bad robots doing downloads keep crashing the site

Issue #83 resolved
Craig Berry created an issue

Various robots are not obeying robots.txt and are continually trying to follow the download links. Each ePub or PDF download of a large document grabs a gigabyte or two of memory and hangs onto it for a long time. Eventually Java GC crashes because it goes compute-bound trying to free memory but is unable to.

We could try implementing recaptcha, but that involves changing all the ordinary hyperlinks into form submissions and writing server-side code to handle those form submissions. For now, it seems we just need to disable downloads.

Some examples of the bad actors from the Apache access log are below.

54.36.148.173 - - [30/Dec/2018:15:46:50 +0000] "GET /works/A31614.xml?page=018-b HTTP/1.1" 502 3919 "-" "Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)"
186.211.3.38 - - [30/Dec/2018:16:06:53 +0000] "GET / HTTP/1.1" 302 482 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
5.9.71.213 - - [30/Dec/2018:16:06:19 +0000] "GET /works/N16402.plain?token=96097b22-04da-497d-ba75-466b4892ea06&cache=no HTTP/1.1" 500 100243 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)"
149.202.82.11 - - [30/Dec/2018:16:06:36 +0000] "GET /works/N13451.tei?token=a0913e81-89ae-429d-b0e6-7a3aa60c85a6&cache=no HTTP/1.1" 200 23530 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)"
144.76.186.38 - - [30/Dec/2018:15:47:31 +0000] "GET /works/N18799.epub?token=95e5424e-1565-4458-957f-5a5081399f92&cache=no HTTP/1.1" 502 4152 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)"
149.202.82.11 - - [30/Dec/2018:16:07:08 +0000] "GET /works/N13451.tei?token=feb87479-6820-418b-90a1-6802b6d87ec8&cache=no HTTP/1.1" 200 23531 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)"
149.202.82.11 - - [30/Dec/2018:16:07:14 +0000] "GET /works/N15865.tei?token=e3639f4b-9108-4d4e-91af-8697de45769c&cache=no HTTP/1.1" 200 97207 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)"

Comments (2)

  1. Craig Berry reporter

    Actually we were missing a robots.txt that could get through the proxy. This has now been added. We've also temporarily disabled file downloads, but the plan is to reenable them once we are sure the robots are leaving us alone.

  2. Craig Berry reporter

    The robots.txt seems to have solved the problem. Downloads have been reenabled for about a week with no trouble.

  3. Log in to comment