Battling AhrefsBot

I noticed that one of my sites was getting an abnormally high amount of traffic. I had just added some new content, so maybe it got onto Hacker News. Nope. The cache was missing about 40% of the time on a fairly long expiry time, so nefarious things were afoot.

AhrefsBot was going to town on my server. Or, at least it was something claiming to be this bot. Eventually I’ll put a stop to this, but first, what’s up?

Here’s a typical log line, which explains the cache misses: - - [30/Jan/2020:12:42:50 -0500] "GET
rss2/ HTTP/1.1" 200 - "-" "Mozilla/5.0 (compatible;
AhrefsBot/6.1; +"

First, that percent-encoded string looks like an exploit. What is it? An online decoder helps. It’s a mix of Hangul and Cyrillic:

「충청남도콜걸」↘예약♪출장안마야한곳☼〈카톡: hwp63〉.【птк455.сом】

Put that into Google Translate:

"Chungcheongnam-do call girl" ↘Reservation ♪ business trip ☼
〈Katok: hwp63〉. 【Птк455.сом】
▼ SG2019-02-26-06-59 [] Motel Trip Massage ShopChungcheongnam-do
♩ Meeting ♩ ♪ [] Motel Trip [] 0dS
┊ [] Call Girl Massage [] Chungcheongnam-do

Chungcheongnam is a province in South Korea.

The Cyrillic address is curious. It looks like .com, but it’s not the Latin alphabet. This is the sort of thing you do to fake someone into going to a site that in a different top-level domain because the characters look similar. Punycode turns that into http://xn--455-bedys.xn--l1adi. It wasn’t accepting connections.

Stopping the Ahrefsbot

I can’t block this by IP address or subnet because the IP addresses are all over the place. Maybe it’s a bot net. There were about three requests every ten seconds, so not enough of an attack to shut me down. It’s annoying at best.

First, I looked at the link in the user agent and it said I could stop it with an entry in robots.txt. So I made that.

User-agent: AhrefsBot
Disallow: /

They said it could take 100 requests or an hour for them to notice. They did not notice. Fine.

I then decided to block it at the .htaccess level so it would get a 403 response. Maybe that it would convince it that my server was worthless and to stop:

RewriteCond %{HTTP_USER_AGENT} ^.*(AhrefsBot).*$ [NC]
RewriteRule .* - [F,L]

That went for a couple of hours, and I’ll come back to this later because this had another problem on my side. Next, I blocked them at the Cloudflare level with a User-Agent based firewall rule. I should have started with this:

Cloudflare fireawall rule

Now none of these requests reached my server, but I could watch in them the Cloudflare logs. The crawling went on for another hour or so before it started to back off. Then it switched IP blocks for a couple of checks, switched back to the original block, then switched to another, new block. Once it figured it that I was blocking it, it stopped trying that hard. I’d see a couple of requests an hour.

It goes on

Even though I’d stopped “AhrefsBot”, I was still getting similar traffic from other agents (mostly coming out of the Dutch provider AS39572): - - [30/Jan/2020:12:52:17 -0500] "GET /search/
%EB%B0%94%5B%5D/feed/rss2/ HTTP/1.1" 200 1990 "-" "Mozilla/5.0
(Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36
(compatible; Googlebot/2.1; +"

It’s the same structure with some different characters:

인터넷바카라사이트2019-03-04-13-22인터넷바둑이게임온라인카지노[]국내 카지노
현황[]온라인 카지노 합법바카라 필승법홀덤바[]

It’s another Korean advertisement:

﹝ Baccarat site ﹞ ✍-Coin Casino-Zu Ruby Go ◆ () ♐
Online Baccarat Sites 2019-03-04-13-22
Status [] Online Casino Legal Baccarat Winning Law Hold'emba []

Perhaps this isn’t AhrefsBot but someone just hijacking it’s name. Looking at the Cloudflare logs, I see that every IP that I’m blocking with my User-Agent rule is a French IP address. Every one of them. Yep, there are two providers in France that tolerate this sort of nonsense, and it was AS16276 bothering me. French IPs addresses claiming to be from a company located Singapore with offices in France and with “roots” in the Ukraine (a country that I already block outright). (See also Fighting referral spam, about some of the same French ISPs).

I fall back to another RewriteRule. Since that URL doesn’t map to anything I have, I can do it with just the front part of the URL. This didn’t seem to work at first. I tracked down and disabled an ErrorDocument handler extra/httpd-multilang-errordoc.conf to get the right response code sent to the client. This might be why my earlier rewrite didn’t work, but I hadn’t turned on logging then. For some stupid reason, sending the ErrorDocument version changed the status back to 200 (because that file was found):

RewriteRule  ^/search/ [R=500]

Ten years ago this sort of thing could bring down a WordPress server because it would overwhelm MySQL. Now, I press a button at Cloudflare and it’s stopped instantly.