|
Basic Robotcop Setup
Here is a simple Robotcop setup which should provide excellent protection
for your site with a minimum of work or impact. Follow these steps, then
kick back and watch your web logs to see what gets caught.
Configuring Robotcop for Apache
Robotcop is inactive until it is specifically enabled in your Apache
httpd.conf file by setting the RobotcopMode directive which tells it how
to handle intercepted spiders. Below is documentation for all supported
Apache directives.
Basic httpd.conf for Robotcop
Here is minimal set of directives that will enable the suggested setup
above if added to the end of your httpd.conf file.
RobotcopMode ban Robotcop Handlers for Apache
Robotcop can be configured to intercept all requests for a specific directory.
Any client requesting files in that directory will be added to the intercept
list and blocked. This is useful for creating trap directories which spiders
are directed into via hidden links. Note that this directive is not needed to
catch spiders if they read your robots.txt file and this directory is listed as
Disallowed there.
Make sure any trap directory is marked as Disallowed in the robots.txt file so that legitimate spiders know to avoid the directory! Failure to do so may result in Google ranking your site #1 under searches for "twit". :-)
<Location /trapdir> Robotcop has a second handler which displays a report on its configuration and lets you see its current status such as blocked IP addresses, the list of evil User-agents, and the parsed robots.txt file in use for the site. You can use the report page to "forgive" an IP which has been mistakenly blocked, or lookup the IP at SamSpade.org for further investigation of harvesters.
<Location /reportdir> Robotcop Apache Directives
RobotcopMode directiveSyntax: RobotcopMode off|ban|fake|tarpit|customDefault: RobotcopMode off Context: server config, virtual host Configure Robotcop behavior when handling misbehaving spiders. The default mode is "off" which means that Robotcop is disabled, and is not providing any protection for the site. The "ban" mode results in the spider receiving a 403 Forbidden message for all further requests to the site. The "fake" mode tells Robotcop to dynamically generate an infinite series of pages full of fake e-mail addresses for poisoning the databases of e-mail address harvesters. The "tarpit" mode will cause Robotcop to generate a series of dull pages which are fed to the spider very very slowly. The "custom" mode will have Robotcop use the intercept mode implemented in custom_arrest.c. This makes it very easy for users to add their own sadistic intercept methods. RobotcopExpire directiveSyntax: RobotcopExpire [minutes]Default: RobotcopExpire 60 Context: server config, virtual host Tells Robotcop how long it should consider a spider IP address as banned before forgiving the address. Spider activity at the address resets the timer so as long as the IP is active it will not be forgiven. Setting this value too large may be dangerous as the IP address may be recycled and used by someone else! RobotcopEvilAgent directiveSyntax: RobotcopEvilAgent [user-agent]Context: server config, virtual host Add a User-agent to be intercepted immediately. Robotcop already includes a list of most known User-Agent strings for known e-mail harvesters. The agent string can be given as just the agent name such as "EvilSpider" or the full User-Agent string including version such as "Mozilla/1.0 (Win32; Evilspider)". This lets you catch spiders pretending to be other agents. RobotcopRobotsFile directiveSyntax: RobotcopRobotsFile [path-to-robots-file]Context: server config, virtual host The full path to the robots.txt file for this server or virtual host. This file will be parsed and consulted to make sure spiders that read it follow the rules listed there. If this directive is not included, Robotcop will not be able to provide nearly as much protection, so be sure to add it! RobotcopRobotsExpire directiveSyntax: RobotcopRobotsExpire [minutes]Default: RobotcopRobotsExpire 1440 Context: server config, virtual host Tells Robotcop how long it should remember the IP address of clients which have requested the robots.txt file. During this period, if a request comes from that IP address which violates a rule in the robots.txt file, that IP address will be added to the intercept list. RobotcopFilterLog directiveSyntax: RobotcopFilterLog [on|off]Default: RobotcopFilterLog on Context: server config, virtual host Tells Robotcop to filter out hits from caught spiders when logging. This way your access log doesn't get cluttered with hits lost in trap directories. Robotcop will still log a single entry for each spider it catches in your error log. |