Discussion:
General robots exclusion becoming fashionable?
(too old to reply)
Jukka K. Korpela
2006-01-13 18:55:00 UTC
Permalink
Using the Internet Archive www.webarchive.org and the W3C link checker
http://validator.w3.org/checklink , I recently noticed that surprisingly many
web servers say "no" to robots, in rather general terms.

This means that many links cannot be checked automatically (since the link
checker behaves as a well-behaving robot). Of course, this typically means
that links are not checked at all. Fighting link rot is frustrating enough;
any new problems mean that people tend to give up.

The webarchive.org seems to have started _removing_ archived data on some
grounds, including robots exclusion. I'm afraid they are doing this because
they fear for copyright infringements. This means that many pages that were
previously accessible through webarchive.org - which was often fine even if
the page exist in their normal addresses but are temporarily unavailable -
now result in a rather obscure (to normal users) error message like

"Robots.txt Query Exclusion.

We're sorry, access to http://www.finlex.fi/fi/laki/ajantasa/1993/19930886
has been blocked by the site owner via robots.txt.
Read more about robots.txt
See the site's robots.txt file.
Try another request or click here to search for all pages on
finlex.fi/fi/laki/ajantasa/1993/19930886
See the FAQs for more info and help, or contact us."

This has also meant that some links to archived versions stopped working,
because the ownership of a domain had been bought by someone and that someone
set up robots.txt - thereby causing _all_ pages that existed in the domain to
be removed from the archive. (This is what seems to have happened to the
pages in www.bobbemer.com that was owned by Bob Bemer, the grand old man of
computing, who died last year.)

Apparently, this is also bad news to users of search engines, i.e. all of us.
If a site has set up a robots.txt file that excludes all robots, then no new
pages from it will be included into search engine databases, and old pages
will probably drop out too.

I wonder how common the problem really is. I guess, and I hope, that it's
just started growing, so we might be able to do something about it by raising
the issue in public.
--
Yucca, http://www.cs.tut.fi/~jkorpela/
Larry__Weiss
2006-01-15 18:22:10 UTC
Permalink
Post by Jukka K. Korpela
Using the Internet Archive www.webarchive.org and the W3C link checker
http://validator.w3.org/checklink , I recently noticed that surprisingly many
web servers say "no" to robots, in rather general terms.
...
This has also meant that some links to archived versions stopped working,
because the ownership of a domain had been bought by someone and that someone
set up robots.txt - thereby causing _all_ pages that existed in the domain to
be removed from the archive. (This is what seems to have happened to the
pages in www.bobbemer.com that was owned by Bob Bemer, the grand old man of
computing, who died last year.)
I just independently noticed the block on the www.bobbemer.com archive
on the WayBack Machine. Somehow it just doesn't make sense to
look at the new domain owner's preferences for access to archived
content from the prior owner. But, I guess it is a hard thing to do
for an archive service to know the history of ownership.

Hopefully when and if a new owner releases the block, the archives will
again be available.

- Larry

Loading...