Ivan Shmakov
2017-07-15 16:24:44 UTC
[Cross-posting to news:comp.infosystems.www.misc as I feel that
this question has more to do with Web than HTML per se.]
privacy) refrain from using Google Search and rely on, say,
https://duckduckgo.com/ instead.
your site fading into obscurity to the majority of Web users --
by the means of being removed from Web search results.
As long as the troublesome bots honor robots.txt (there're those
that do not; but then, the above won't work on them, either),
a more sane solution would be to limit the /rate/ the bots
request your pages for indexing, like:
### robots.txt
### Data:
## Request that the bots wait at least 3 seconds between requests.
User-agent: *
Crawl-delay: 3
### robots.txt ends here
This way, the bots will still scan all your 2e5 pages, but their
accessess will be spread over about a week -- which (I hope)
will be well within "acceptable use limits" of your hosting
company.
[...]
this question has more to do with Web than HTML per se.]
I have a website organized as a large number (> 200,000) of pages.
It is hosted by a large Internet hosting company.
Many websites provide much more information than mine by computing
info on-the-fly with server scripts, but I have, in effect, all the
query results pre-computed. I waste a few gigabytes for the data,
but that's almost nothing these days, and don't waste the server's
time on scripts.
My users may click to 10 or 20 pages in a session. But the indexing
bots want to read all 200,000+ pages! My host has now complained
that the site is under "bot attack" and has asked me to check my own
laptop for viruses!
I'm happy anyway to reduce the bot activity. I don't mind having my
site indexed, but once or twice a year would be enough!
I see that there is a way to stop the Google Bot specifically. I'd
love it if I could do the opposite -- have *only* Google index my
site.
JFTR, I personally (as well as many other users who value theirIt is hosted by a large Internet hosting company.
Many websites provide much more information than mine by computing
info on-the-fly with server scripts, but I have, in effect, all the
query results pre-computed. I waste a few gigabytes for the data,
but that's almost nothing these days, and don't waste the server's
time on scripts.
My users may click to 10 or 20 pages in a session. But the indexing
bots want to read all 200,000+ pages! My host has now complained
that the site is under "bot attack" and has asked me to check my own
laptop for viruses!
I'm happy anyway to reduce the bot activity. I don't mind having my
site indexed, but once or twice a year would be enough!
I see that there is a way to stop the Google Bot specifically. I'd
love it if I could do the opposite -- have *only* Google index my
site.
privacy) refrain from using Google Search and rely on, say,
https://duckduckgo.com/ instead.
A technician at the hosting company wrote to me
I believe that the solution(s) like the above will only lead toAs per the above logs and hitting IP addresses, we have blocked the
46.229.168.* IP range to prevent the further abuse and advice you to
also check incoming traffic and block such IP's in future.
We have also blocked the bots by adding the following entry
in robots.txt:-
User-agent: AhrefsBot
Disallow: /
User-agent: MJ12bot
Disallow: /
User-agent: SemrushBot
Disallow: /
User-agent: YandexBot
Disallow: /
User-agent: Linguee Bot
Disallow: /
46.229.168.* IP range to prevent the further abuse and advice you to
also check incoming traffic and block such IP's in future.
We have also blocked the bots by adding the following entry
in robots.txt:-
User-agent: AhrefsBot
Disallow: /
User-agent: MJ12bot
Disallow: /
User-agent: SemrushBot
Disallow: /
User-agent: YandexBot
Disallow: /
User-agent: Linguee Bot
Disallow: /
your site fading into obscurity to the majority of Web users --
by the means of being removed from Web search results.
As long as the troublesome bots honor robots.txt (there're those
that do not; but then, the above won't work on them, either),
a more sane solution would be to limit the /rate/ the bots
request your pages for indexing, like:
### robots.txt
### Data:
## Request that the bots wait at least 3 seconds between requests.
User-agent: *
Crawl-delay: 3
### robots.txt ends here
This way, the bots will still scan all your 2e5 pages, but their
accessess will be spread over about a week -- which (I hope)
will be well within "acceptable use limits" of your hosting
company.
[...]
--
FSF associate member #7257 58F8 0F47 53F5 2EB2 F6A5 8916 3013 B6A0 230E 334A
FSF associate member #7257 58F8 0F47 53F5 2EB2 F6A5 8916 3013 B6A0 230E 334A