wget: Robot Exclusion
9.1 Robot Exclusion
===================
It is extremely easy to make Wget wander aimlessly around a web site,
sucking all the available data in progress. ‘wget -r SITE’, and you’re
set. Great? Not for the server admin.
As long as Wget is only retrieving static pages, and doing it at a
reasonable rate (see the ‘--wait’ option), there’s not much of a
problem. The trouble is that Wget can’t tell the difference between the
smallest static page and the most demanding CGI. A site I know has a
section handled by a CGI Perl script that converts Info files to HTML on
the fly. The script is slow, but works well enough for human users
viewing an occasional Info file. However, when someone’s recursive Wget
download stumbles upon the index page that links to all the Info files
through the script, the system is brought to its knees without providing
anything useful to the user (This task of converting Info files could be
done locally and access to Info documentation for all installed GNU
software on a system is available from the ‘info’ command).
To avoid this kind of accident, as well as to preserve privacy for
documents that need to be protected from well-behaved robots, the
concept of “robot exclusion” was invented. The idea is that the server
administrators and document authors can specify which portions of the
site they wish to protect from robots and those they will permit access.
The most popular mechanism, and the de facto standard supported by
all the major robots, is the “Robots Exclusion Standard” (RES) written
by Martijn Koster et al. in 1994. It specifies the format of a text
file containing directives that instruct the robots which URL paths to
avoid. To be found by the robots, the specifications must be placed in
‘/robots.txt’ in the server root, which the robots are expected to
download and parse.
Although Wget is not a web robot in the strictest sense of the word,
it can download large parts of the site without the user’s intervention
to download an individual page. Because of that, Wget honors RES when
downloading recursively. For instance, when you issue:
wget -r http://www.example.com/
First the index of ‘www.example.com’ will be downloaded. If Wget
finds that it wants to download more documents from that server, it will
request ‘http://www.example.com/robots.txt’ and, if found, use it for
further downloads. ‘robots.txt’ is loaded only once per each server.
Until version 1.8, Wget supported the first version of the standard,
written by Martijn Koster in 1994 and available at
<http://www.robotstxt.org/orig.html>. As of version 1.8, Wget has
supported the additional directives specified in the internet draft
‘<draft-koster-robots-00.txt>’ titled “A Method for Web Robots Control”.
The draft, which has as far as I know never made to an RFC, is available
at <http://www.robotstxt.org/norobots-rfc.txt>.
This manual no longer includes the text of the Robot Exclusion
Standard.
The second, less known mechanism, enables the author of an individual
document to specify whether they want the links from the file to be
followed by a robot. This is achieved using the ‘META’ tag, like this:
<meta name="robots" content="nofollow">
This is explained in some detail at
<http://www.robotstxt.org/meta.html>. Wget supports this method of
robot exclusion in addition to the usual ‘/robots.txt’ exclusion.
If you know what you are doing and really really wish to turn off the
robot exclusion, set the ‘robots’ variable to ‘off’ in your ‘.wgetrc’.
You can achieve the same effect from the command line using the ‘-e’
switch, e.g. ‘wget -e robots=off URL...’.