Robotcop
Basic Robotcop Setup

Here is a simple Robotcop setup which should provide excellent protection for your site with a minimum of work or impact. Follow these steps, then kick back and watch your web logs to see what gets caught.
  1. In your httpd.conf, set the RobotcopMode to "ban" to block spiders when caught.
  2. Tell Robotcop where your robots.txt file is with the RobotcopRobotsFile Apache directive.
  3. Add a rule to your robots.txt file like "Disallow /mytrapdir".
  4. Add a hidden link near the top of your web pages to this directory.
Configuring Robotcop for Apache

Robotcop is inactive until it is specifically enabled in your Apache httpd.conf file by setting the RobotcopMode directive which tells it how to handle intercepted spiders. Below is documentation for all supported Apache directives.
Basic httpd.conf for Robotcop

Here is minimal set of directives that will enable the suggested setup above if added to the end of your httpd.conf file.

RobotcopMode ban
RobotcopRobotsFile /path/to/your/apache/htdocs/robots.txt
<Location /mytrapdir>
    SetHandler robotcop-arrest
</Location>

Robotcop Handlers for Apache

Robotcop can be configured to intercept all requests for a specific directory. Any client requesting files in that directory will be added to the intercept list and blocked. This is useful for creating trap directories which spiders are directed into via hidden links. Note that this directive is not needed to catch spiders if they read your robots.txt file and this directory is listed as Disallowed there.

Make sure any trap directory is marked as Disallowed in the robots.txt file so that legitimate spiders know to avoid the directory! Failure to do so may result in Google ranking your site #1 under searches for "twit". :-)

<Location /trapdir>
    Sethandler robotcop-arrest
</Location>

Robotcop has a second handler which displays a report on its configuration and lets you see its current status such as blocked IP addresses, the list of evil User-agents, and the parsed robots.txt file in use for the site. You can use the report page to "forgive" an IP which has been mistakenly blocked, or lookup the IP at SamSpade.org for further investigation of harvesters.

<Location /reportdir>
    Sethandler robotcop-report
</Location>

Robotcop Apache Directives

RobotcopMode directive

Syntax: RobotcopMode off|ban|fake|tarpit|custom
Default: RobotcopMode off
Context: server config, virtual host

Configure Robotcop behavior when handling misbehaving spiders. The default mode is "off" which means that Robotcop is disabled, and is not providing any protection for the site.

The "ban" mode results in the spider receiving a 403 Forbidden message for all further requests to the site.

The "fake" mode tells Robotcop to dynamically generate an infinite series of pages full of fake e-mail addresses for poisoning the databases of e-mail address harvesters.

The "tarpit" mode will cause Robotcop to generate a series of dull pages which are fed to the spider very very slowly.

The "custom" mode will have Robotcop use the intercept mode implemented in custom_arrest.c. This makes it very easy for users to add their own sadistic intercept methods.

RobotcopExpire directive

Syntax: RobotcopExpire [minutes]
Default: RobotcopExpire 60
Context: server config, virtual host

Tells Robotcop how long it should consider a spider IP address as banned before forgiving the address. Spider activity at the address resets the timer so as long as the IP is active it will not be forgiven. Setting this value too large may be dangerous as the IP address may be recycled and used by someone else!

RobotcopEvilAgent directive

Syntax: RobotcopEvilAgent [user-agent]
Context: server config, virtual host

Add a User-agent to be intercepted immediately. Robotcop already includes a list of most known User-Agent strings for known e-mail harvesters. The agent string can be given as just the agent name such as "EvilSpider" or the full User-Agent string including version such as "Mozilla/1.0 (Win32; Evilspider)". This lets you catch spiders pretending to be other agents.

RobotcopRobotsFile directive

Syntax: RobotcopRobotsFile [path-to-robots-file]
Context: server config, virtual host

The full path to the robots.txt file for this server or virtual host. This file will be parsed and consulted to make sure spiders that read it follow the rules listed there. If this directive is not included, Robotcop will not be able to provide nearly as much protection, so be sure to add it!

RobotcopRobotsExpire directive

Syntax: RobotcopRobotsExpire [minutes]
Default: RobotcopRobotsExpire 1440
Context: server config, virtual host

Tells Robotcop how long it should remember the IP address of clients which have requested the robots.txt file. During this period, if a request comes from that IP address which violates a rule in the robots.txt file, that IP address will be added to the intercept list.

RobotcopFilterLog directive

Syntax: RobotcopFilterLog [on|off]
Default: RobotcopFilterLog on
Context: server config, virtual host

Tells Robotcop to filter out hits from caught spiders when logging. This way your access log doesn't get cluttered with hits lost in trap directories. Robotcop will still log a single entry for each spider it catches in your error log.