JavaScript Editor Ajax software     Free javascripts 



Main Page

robots.txt Pattern Exclusion
robots.txt
is a text file located in the root directory of a web site that adheres to the
robots.txt
stan-
dard. Taking the risk of repeating ourselves and generating a bit of “duplicate content,” here are three
basic things to keep in mind regarding
robots.txt
:
?
There can be only one
robots.txt
file.
?
The proper location of
robots.txt
is in the root directory of a web site.
?
robots.txt
files located in subdirectories will not be accessed (or honored).
The official resource with the official documentation of
robots.txt
is
http://www.robotstxt
.org/
. There you can find a Frequently Asked Questions page, the complete reference, and a list with
the names of the robots crawling the web.
If you peruse your logs, you will see that search engine spiders visit this particular file very frequently. This
is because they make an effort not to crawl or index any files that are excluded by
robots.txt
and want to
keep a very fresh copy cached.
robots.txt
excludes URLs from a search engine on a very simple pattern-
matching basis, and it is frequently an easier method to use when eliminating entire directories from a site,
or, more specifically, when you want to exclude many URLs that start with the same characters.
Sometimes for various internal reasons within a (usually large) company, it is not possible to gain access
to modify this file in the root directory. In that case, so long as you have access to the source code of the
part the application in question, use the meta
robots
tag.
A
robots.txt
file includes
User-agent
specifications, which define your exclusion targets, and
Disallow
entries for one or more URLs you want to exclude therein. Lines in
robots.txt
that start
with
#
are comments, and are ignored.
The following
robots.txt
file, placed in the root folder of your site, would not permit any robots (
*
) to
access any files on the site:
# Forbid all robots from browsing your site
User-agent: *
Disallow: /
robots.txt
is not a form of security! It does not prevent access to any files. It does
stop a search engine from indexing the content, and therefore prevents users from
navigating to those particular resources via a search engine results page. However,
users could access the pages by navigating directly to them. Also,
robots.txt
itself
is a public resource, and anyone who wants to peruse it can do so by pointing their
browser to
/robots.txt
. If anything, using it for “security” would only make those
resources even more obvious to potential hackers if used for that incorrect purpose.
To protect content, you should use the traditional ways of authenticating users, and
authorizing them to visit resources of your site.
99
Chapter 5: Duplicate Content
c05.qxd:c05 10:40 99


JavaScript Editor Ajax software     Free javascripts