Ajax software
Free javascripts
↑
Main Page
The next example disallows any URLs that start with
/directory
from being indexed by Google:
# Disallow googlebot from indexing anything that starts with /directory
User-agent: googlebot
Disallow: /directory
googlebot
is Google’s user-agent name. It is useful to think of each
Disallow
as matching
prefixes
,
not files or URLs. Notably,
/directory.html
(because
/directory
is a prefix of /
directory.html
)
would also match that rule, and be excluded. If you want only the contents of the
directory
folder to
be excluded, you should specify
/directory/
instead. That last
/
prevents
/directory.html
from
matching. Note also that the leading
/
is always necessary on exclusions. The following would be invalid:
Disallow: directory
The
*
we used for
User-agent
doesn’t function as a wildcard “glob” operator. Not that it would be
useful for anything, but
goo*bot
would not match
googlebot
, and is invalid.
Wildcard “glob” operators are also not
officially
valid in the
Disallow:
directive either, but Google, MSN,
and more recently Yahoo!, support this non-standard form of wildcard matching. We generally do not
recommend its use, however, both because it is not part of the standard, and because various other search
engines do
not
support it.
For information regarding the implementations of wildcard matching from search engine vendors, read:
?
Google:
http://www.google.com/support/webmasters/bin/answer.py?answer=35303
?
MSN:
http://search.msn.com.sg/docs/siteowner.aspx?t=SEARCH_WEBMASTER_REF_
RestrictAccessToSite.htm#b
?
Yahoo!:
http://www.ysearchblog.com/archives/000372.html
Using wildcards, the following
robots.txt
file would tell Google not to index any URL containing the
substring
print=
anywhere within the URL:
User-agent: googlebot
Disallow: /*print=
It may seem counterintuitive and rather annoying that there is no
Allow
directive to complement
Disallow
. Certain search engines (Google and Yahoo! included) do indeed permit its use, but nuances
of their interpretations may vary, and it is
not
part of the standard. We strongly recommend not using
this directive.
To elaborate, a string specified after
Disallow:
is equivalent to the regular expres-
sion
^<your string>.*$
— which means that it matches anything that begins with
that string.
If you must use wildcards in the
Disallow
clause, it is wise to do so only under a
specific user-agent clause; for example,
User-agent: googlebot
.
100
Chapter 5: Duplicate Content
c05.qxd:c05 10:40 100
Ajax software
Free javascripts
→