Discussion:
Preventing robot indexing attacks
(too old to reply)
Ivan Shmakov
2017-07-15 16:24:44 UTC
Permalink
[Cross-posting to news:comp.infosystems.www.misc as I feel that
this question has more to do with Web than HTML per se.]
I have a website organized as a large number (> 200,000) of pages.
It is hosted by a large Internet hosting company.
Many websites provide much more information than mine by computing
info on-the-fly with server scripts, but I have, in effect, all the
query results pre-computed. I waste a few gigabytes for the data,
but that's almost nothing these days, and don't waste the server's
time on scripts.
My users may click to 10 or 20 pages in a session. But the indexing
bots want to read all 200,000+ pages! My host has now complained
that the site is under "bot attack" and has asked me to check my own
laptop for viruses!
I'm happy anyway to reduce the bot activity. I don't mind having my
site indexed, but once or twice a year would be enough!
I see that there is a way to stop the Google Bot specifically. I'd
love it if I could do the opposite -- have *only* Google index my
site.
JFTR, I personally (as well as many other users who value their
privacy) refrain from using Google Search and rely on, say,
https://duckduckgo.com/ instead.
A technician at the hosting company wrote to me
As per the above logs and hitting IP addresses, we have blocked the
46.229.168.* IP range to prevent the further abuse and advice you to
also check incoming traffic and block such IP's in future.
We have also blocked the bots by adding the following entry
in robots.txt:-
User-agent: AhrefsBot
Disallow: /
User-agent: MJ12bot
Disallow: /
User-agent: SemrushBot
Disallow: /
User-agent: YandexBot
Disallow: /
User-agent: Linguee Bot
Disallow: /
I believe that the solution(s) like the above will only lead to
your site fading into obscurity to the majority of Web users --
by the means of being removed from Web search results.

As long as the troublesome bots honor robots.txt (there're those
that do not; but then, the above won't work on them, either),
a more sane solution would be to limit the /rate/ the bots
request your pages for indexing, like:

### robots.txt

### Data:

## Request that the bots wait at least 3 seconds between requests.
User-agent: *
Crawl-delay: 3

### robots.txt ends here

This way, the bots will still scan all your 2e5 pages, but their
accessess will be spread over about a week -- which (I hope)
will be well within "acceptable use limits" of your hosting
company.

[...]
--
FSF associate member #7257 58F8 0F47 53F5 2EB2 F6A5 8916 3013 B6A0 230E 334A
Eli the Bearded
2017-07-15 21:04:08 UTC
Permalink
Post by Ivan Shmakov
[Cross-posting to news:comp.infosystems.www.misc as I feel that
this question has more to do with Web than HTML per se.]
:^)
Post by Ivan Shmakov
I have a website organized as a large number (> 200,000) of pages.
It is hosted by a large Internet hosting company.
...
Post by Ivan Shmakov
My users may click to 10 or 20 pages in a session. But the indexing
bots want to read all 200,000+ pages! My host has now complained
that the site is under "bot attack" and has asked me to check my own
laptop for viruses!
200k pages isn't that huge, and if static files on disk, as described in
a snipped out part, shouldn't be that hard to serve. Bandwidth may be an
issue, depending on how you are being charged. And on a shared system,
which I think you might have, your options for optimizing for massive
amounts of static files might be limited.
Post by Ivan Shmakov
I'm happy anyway to reduce the bot activity. I don't mind having my
site indexed, but once or twice a year would be enough!
Some of the better search engines will gladly consult site map files
that give hints about what needs reindexing. See:

https://www.sitemaps.org/protocol.html
Post by Ivan Shmakov
I see that there is a way to stop the Google Bot specifically. I'd
love it if I could do the opposite -- have *only* Google index my
site.
JFTR, I personally (as well as many other users who value their
privacy) refrain from using Google Search and rely on, say,
https://duckduckgo.com/ instead.
Yeah, Google only is an "all your eggs in one basket" route. I, too,
have been using DDG almost exclusively for several years.
Post by Ivan Shmakov
A technician at the hosting company wrote to me
As per the above logs and hitting IP addresses, we have blocked the
46.229.168.* IP range to prevent the further abuse and advice you to
also check incoming traffic and block such IP's in future.
46.229.168.0-46.229.168.255 is:

netname: ADVANCEDHOSTERS-NET

Can't say I've heard of them.
Post by Ivan Shmakov
We have also blocked the bots by adding the following entry
in robots.txt:-
User-agent: AhrefsBot
Yes, block them. Not a search engine, but a commercial SEO service.
https://ahrefs.com/robot
Post by Ivan Shmakov
User-agent: MJ12bot
Eh, maybe block, maybe not. Seems to be real serach engine.
http://mj12bot.com/
Post by Ivan Shmakov
User-agent: SemrushBot
Yes, block them. Not a search engine, but a commercial SEO service.
https://www.semrush.com/bot/
Post by Ivan Shmakov
User-agent: YandexBot
Real Russian search engine.
https://yandex.com/support/webmaster/robot-workings/check-yandex-robots.xml
Post by Ivan Shmakov
User-agent: Linguee Bot
Real service, but dubious value to a webmaster.
http://www.botreports.com/user-agent/linguee-bot.shtml

All bots can be impersonated by other bots, so you can't be sure the
User-Agent: will be the real identity of the bots. You can spend a lot
of time researching bots and the characteristics of real bot usage, eg
hostnames or IP address ranges of legit bot servers.

Given the little I've seen here, I wonder if you have someone at
Advanced Hosters impersonating bots to suck your site down.
Post by Ivan Shmakov
As long as the troublesome bots honor robots.txt (there're those
that do not; but then, the above won't work on them, either),
a more sane solution would be to limit the /rate/ the bots
### robots.txt
## Request that the bots wait at least 3 seconds between requests.
User-agent: *
Crawl-delay: 3
### robots.txt ends here
Except for Linguee, I think all of the bots listed above are
well-behaved and will obey robots.txt, but I don't know if they are all
advanced enough to know Crawl-delay. Some of them explicitly state they
do, however.
Post by Ivan Shmakov
This way, the bots will still scan all your 2e5 pages, but their
accessess will be spread over about a week -- which (I hope)
will be well within "acceptable use limits" of your hosting
company.
Only bot I've ever had to blacklist was a MSN bot that absolutely
refused to stop hitting one page over and over again a few years ago. I
used a server directive to shunt that one bot to 403 Forbidden errors.

Elijah
------
stopped worring about bots a long time ago
Ivan Shmakov
2017-07-16 08:00:56 UTC
Permalink
[...]
Post by Eli the Bearded
Post by Ivan Shmakov
I'm happy anyway to reduce the bot activity. I don't mind having
my site indexed, but once or twice a year would be enough!
Some of the better search engines will gladly consult site map files
https://www.sitemaps.org/protocol.html
... Learning sitemaps is in my todo list for a while now...

[...]
Post by Eli the Bearded
Post by Ivan Shmakov
A technician at the hosting company wrote to me
As per the above logs and hitting IP addresses, we have blocked
the 46.229.168.* IP range to prevent the further abuse and advice
you to also check incoming traffic and block such IP's in future.
netname: ADVANCEDHOSTERS-NET
Can't say I've heard of them.
Same here.

[...]
Post by Eli the Bearded
All bots can be impersonated by other bots, so you can't be sure the
User-Agent: will be the real identity of the bots.
True in general, but of little relevance in the context of
robots.txt. For one thing, a misbehaving robot may very well
have one string for User-Agent:, yet look for something entirely
different in robots.txt (if it even decides to honor the file.)
Post by Eli the Bearded
You can spend a lot of time researching bots and the characteristics
of real bot usage, eg hostnames or IP address ranges of legit bot
servers.
[...]
Post by Eli the Bearded
Except for Linguee, I think all of the bots listed above are
well-behaved and will obey robots.txt,
FWIW, Linguee claim "[they] want [their] crawler to be as polite
as possible." (http://linguee.com/bot.)
Post by Eli the Bearded
but I don't know if they are all advanced enough to know Crawl-delay.
Some of them explicitly state they do, however.
Now that I've watched closely, but I don't recall stumbling upon
a robot that would honor robots.txt, yet would issue request in
quick succession contrary to Crawl-delay:. That might've been
because of bot-side rate limits, of course.
Post by Eli the Bearded
Post by Ivan Shmakov
This way, the bots will still scan all your 2e5 pages, but their
accessess will be spread over about a week -- which (I hope) will be
well within "acceptable use limits" of your hosting company.
Only bot I've ever had to blacklist was a MSN bot that absolutely
refused to stop hitting one page over and over again a few years ago.
I used a server directive to shunt that one bot to 403 Forbidden
errors.
There seem to be a few misbehaving robots that frequent my
servers; most mask as browsers -- and of course do not ever
consider robots.txt.

For instance, 37.59.55.128, 109.201.142.109, 188.165.233.228,
and I recall some such activity from Baidu networks. (Alongside
with their "regular", well-behaving crawler.)
Post by Eli the Bearded
Elijah ------ stopped worrying about bots a long time ago
How so?
--
FSF associate member #7257 np. Into The Dark -- Radiarc 3013 B6A0 230E 334A
Eli the Bearded
2017-07-16 20:27:40 UTC
Permalink
Post by Eli the Bearded
Elijah ------ stopped worrying about bots a long time ago
A combination of enough capacity to not care about load and no longer
using web advertisting, removing the necessity to audit logs.

Elijah
------
understands others have other needs and priorities
Doc O'Leary
2017-07-16 14:50:46 UTC
Permalink
For your reference, records indicate that
Post by Ivan Shmakov
[Cross-posting to news:comp.infosystems.www.misc as I feel that
this question has more to do with Web than HTML per se.]
Assuming we’re not missing any info as a result . . .
Post by Ivan Shmakov
My users may click to 10 or 20 pages in a session. But the indexing
bots want to read all 200,000+ pages! My host has now complained
that the site is under "bot attack" and has asked me to check my own
laptop for viruses!
This doesn’t make much sense. The web host sounds incompetent, so I
don’t know that we can trust what is being reported by them. Getting
(legitimately) spidered is not an attack. Any attack you *may* be
under would not be as a result of a virus on your own non-server
computer. I’d find a different hosting provider.
Post by Ivan Shmakov
A technician at the hosting company wrote to me
As per the above logs and hitting IP addresses, we have blocked the
46.229.168.* IP range to prevent the further abuse and advice you to
also check incoming traffic and block such IP's in future.
There is nothing about the 46.229.160.0/20 range in question that
indicates it represents a legitimate bot. Do the logs actually
indicate vanilla spidering, or something more nefarious like looking
for PHP/WordPress exploits? I see a lot of traffic like that.

In such cases, editing robots.txt is unlikely to solve the problem.
Generally, I don’t even bother configuring the web server to deny a
serious attacker. I’d just drop their whole range into my firewall,
because odds are good that a dedicated attacker isn’t going to only
go after port 80.

That might be beyond the scope of what a basic web hosting company
provides but, really, given that a $15/year VPS can handle most
traffic for even a 200K page static site with ease, I really can’t
imagine what the real issue is here. More details needed.
--
"Also . . . I can kill you with my brain."
River Tam, Trash, Firefly
Ivan Shmakov
2017-07-19 12:12:36 UTC
Permalink
Post by Doc O'Leary
A technician at the hosting company wrote to me
As per the above logs and hitting IP addresses, we have blocked
the 46.229.168.* IP range to prevent the further abuse and advice
you to also check incoming traffic and block such IP's in future.
There is nothing about the 46.229.160.0/20 range in question that
indicates it represents a legitimate bot. Do the logs actually
indicate vanilla spidering, or something more nefarious like looking
for PHP/WordPress exploits? I see a lot of traffic like that.
Same here.
Post by Doc O'Leary
In such cases, editing robots.txt is unlikely to solve the problem.
Yes. (Although blocking the range is.)
Post by Doc O'Leary
Generally, I don’t even bother configuring the web server to deny a
serious attacker. I’d just drop their whole range into my firewall,
because odds are good that a dedicated attacker isn’t going to only
go after port 80.
Personally, I configured my Web server to redirect requests like
that to localhost:discard, and let the scanners disconnect at
their own timeouts. (5-15 s, from the looks of it.) Like:

<IfModule mod_rewrite.c>
RewriteCond %{REQUEST_URI} \
^/(old|sql(ite)?|wp|XXX|[-/])*(admin|manager|YYY) [nocase]
RewriteRule .* http://ip6-localhost:9/ [P]
</IfModule>

(That is, the "tar pit" approach. Alternatively, one may use
mod_security, but to me that seemed like overkill.)

When the scanner in question is a simple single-threaded,
"sequential" program, that may also reduce its impact on the
rest of the Web.

OTOH, when I start receiving spam from somewhere, the respective
range has a good chance of ending up in my firewall rules.

(There was one dedicated spammer that I reported repeatedly to
their hosters, only for them to move to some other service. I'm
afraid I grew lazy when they moved to "Zomro" networks; e. g.,
178.159.42.0/25. They seem to be staying there for months now.)
Post by Doc O'Leary
That might be beyond the scope of what a basic web hosting company
provides but, really, given that a $15/year VPS can handle most
traffic for even a 200K page static site with ease, I really can’t
imagine what the real issue is here. More details needed.
Now, that's interesting. The VPS services I use (or used) are
generally $5/month or more.

Virpus VPSes were pretty cheap back when I used them, but
somewhat less reliable.
Post by Doc O'Leary
-- "Also . . . I can kill you with my brain." River Tam, Trash,
Firefly
... Also in my to-watch list. (I've certainly liked Serenity.)
--
FSF associate member #7257 58F8 0F47 53F5 2EB2 F6A5 8916 3013 B6A0 230E 334A
Doc O'Leary
2017-07-19 14:55:51 UTC
Permalink
For your reference, records indicate that
Post by Ivan Shmakov
Post by Doc O'Leary
That might be beyond the scope of what a basic web hosting company
provides but, really, given that a $15/year VPS can handle most
traffic for even a 200K page static site with ease, I really can’t
imagine what the real issue is here. More details needed.
Now, that's interesting. The VPS services I use (or used) are
generally $5/month or more.
Well, I certainly can and do pay more for a VPS when I need more
resources, but when it comes to hosting a static site like we’re
talking about here, you really don’t need much to do it. These days,
the limiting factor is quickly becoming the cost of an IPv4 address.

There’s really no reason I can think of that a basic virtual web host
should be balking over the OP’s site. It’s the kind of thing I’d
host for friends for free because the overhead would seem like a
rounding error.
Post by Ivan Shmakov
... Also in my to-watch list. (I've certainly liked Serenity.)
Personally, I think the series was *much* better than the movie.
--
"Also . . . I can kill you with my brain."
River Tam, Trash, Firefly
Loading...