Robotcop
Project History

Robotcop started in response to excellent discussions on Evolt.org and Slashdot.org which outlined the need for the software to exist. It was written by three software developers in Northern California.

We looked around for existing solutions to the problem and felt that an Apache module would be another great weapon, in particular because most of the existing tools are CGIs which cannot review all requests to the site. We think Robotcop is currently the best way to protect a website against misbehaving spiders.

Technical Overview

Robotcop is a module written in C which is hooked into the access control API of the webserver. All requests to the site are subject to a number of checks before Robotcop allows the request to proceed. If a check fails, Robotcop takes control of the request to counter-attack or ban the spider, and the IP address of the spider is added to an intercept list so that requests from the spider during that period will be caught immediately. The IP address is removed from the list when it sends no further requests for a configurable period.

These are the checks applied to all requests to the webserver protected by Robotcop:

  1. If the IP address of the request is on the intercept list the request is blocked by Robotcop.
  2. If this request is for a directory which has been marked as a trap by the webmaster the IP address is added to the intercept list and blocked.
  3. If the request is from a User-Agent which matches a known evil spider, the IP address is added to the intercept list and blocked.
  4. If the IP address has previously read the robots.txt file and this request violates a rule in that file, the IP address is added to the intercept list and blocked.
  5. If the request is for the robots.txt file, remember the client IP address and make sure they follow the rules in it later. The request is allowed to continue as normal.
Project Direction

Robotcop is in active development right now. Have a feature suggestion? Want to help out with development or serious testing? Try out the software on your own website and join our mailing list. Here are the major improvements we intend to add to the software:
  1. Port to Apache 2.x
  2. Port to ISAPI (Zeus/IIS)
  3. Add intercept list synchronization so Robotcop protected servers can work together to share their lists. Required for load distributed server farms.
  4. Add more intercept methods to make even more problems for e-mail harvesting software!