Git repo web crawler trouble

I don’t worry about web crawlers on this site - there are quite a few that scrape it but they do it in a way that has very little impact on my server, maybe a couple of hundred requests a day max.

A while ago I decided to host my git repos myself as a way of getting away from Github (I also have a Codeberg account but I haven’t started using it yet). The git site has a lot of pages, one per commit per file, dynamically generated by cgit when requested. Each page has links to many others.

As you may imagine, this is the kind of thing that bots seem drawn to - so many links to follow! It has a robots.txt that tells them not to do this but that doesn’t stop the worst scrapers.

I don’t check my logs regularly so it was months before I noticed that requests were being send around once per minute to the site, increasing my normally very low processor use. Assuming this is as nonstop as it seems, this is over 80,000 requests per day. Each request comes from an new, unrelated IP address making it very hard to stop. This means that they are using either a botnet or buying the ability to send requests from normal people.

New defences

Anyone can access the some pages such as the repo page and current state of files. When requesting a page from a previous commit for the first time you will get served a static page (with a 403 Forbidden status code) that sets a cookie - human set to yes - and a Refresh header to reload the page.

If the human:yes cookie is sent with a request for a restricted page the page will then be generated by the cgit back-end and provided. The bots continue to request files but now it doesn’t impact the server nearly as much as they all get sent the static page.

If you can’t access the page in this way, you can git clone instead.

In a possibly unrelated note, this site is no longer accessible over tor because it was using so much processor time, maybe because of scraping.