Gitea has a builtin defense against this, `REQUIRE_SIGNIN_VIEW=expensive`, that completely stopped AI traffic issues for me and cut my VPS's bandwidth usage by 95%.
If you don't need global access, I have found that Geoblocking is the best first step. Especially if you are in a small country with a small footprint and you can get away at blocking the rest of the world. But even if you live in the US, excluding Russia, India, Iran and a few others will cut your traffic by double digit percent.
In the article, quite a few listed sources of traffic would simply be completely unable to access the server if the author could get away with a geoblock.
> VNPT and Bunny Communications are home/mobile ISPs. i cannot ascertain for sure that their IPs are from domestic users, but it seems worrisome that these are among the top scraping sources once you remove the most obviously malicious actors.
This will be in part people on home connections tinkering with LLMs at home, blindly running some scraper instead of (or as well as) using the common pre-scraped data-sets and their own data. A chunk of it will be from people who have been compromised (perhaps by installing/updating a browser add-in or “free” VPN client that has become (or always was) nefarious) and their home connection is being farmed out by VPN providers selling “domestic IP” services that people running scrapers are buying.
I'm not 100% against AI, but I do cheer loudly when I see things like this!
I'm also left wondering about what other things you could do? For example - I have several friends that built their own programming languages, I wonder what the impact would be if you translate lots of repositories to your own language and host it for bots to scrape? Could you introduce sufficient bias in a LLM to make an esoteric programming language popular?
I wonder if this is going to push more and more services to be hidden from the public internet.
My personal services are only accessible from my own LAN or via a VPN.
If I wanted to share it with a few friends I would use something like Tailscale and invite them to my tailnet. If the number of people grows I would put everything behind a login-wall.
This of course doesn't cover services I genuinely might want to be exposed to the public. In that case the fight with the bots is on, assuming I decide I want to bother at all
I do not understand why the scrappers do not do it in a smarter way: clone the repositories and fetches from there on a daily or so basis. I have witnessed one going through every single blame and log links across all branches and redoing it every few hours! It sounds like they did not even tried to optimize their scrappers.
On my forge, I mirror some large repos that I use for CI jobs so I'm not putting unfair load on the upstream project's repos. Those are the only repos large enough to cause problems with the asshole AI scrapers. My solution was to put the web interface for those repos behind oauth2-proxy (while leaving the direct git access open to not impact my CI jobs). It made my CPU usage drop 80% instantly, while still leaving my (significantly smaller) personal projects fully open for anyone to browse unimpeded.
Do git clients support HTTP/2.0 yet? Or could they use SSH? I ask because I block most of the bots by requiring HTTP/2.0 even on my silliest of throw-away sites. I agree their caching method is good and should be done when much of the content is cachable. Blocking specific IP's is a never-ending game of whack-a-mole. I do block some data-centers ASN's as I do not expect real people to come from them even though they could. It's an acceptable trade-off for my junk. There are many things people can learn from capturing TCP SYN packets for a day and comparing to access logs sorting out bots vs legit people. There are quite a few headers that a browser will send that most bots do not. Many bots also lack sending a valid TCP MSS and TCP WINDOW.
Anyway, test some scrapers and bots here [1] and let me know if they get through. A successful response will show "Can your bot see this? If so you win 10 bot points." and a figlet banner. Read-only SFTP login is "mirror" and no pw.
[Edit] - I should add that I require bots to tell me they speak English optionally in addition to other languages but not a couple that are blocked, e.g. en,de-DE,de good, de-DE,de will fail, because. Not suggesting anyone do this.
I, too, am selfhosting some projects on an old computer. And the fact that you can "hear internet" (with the fans going on) is really cool (unless you're trying to sleep while being scrapped).
"Self-hosting anything that is deemed "content" openly on the web in 2025 is a battle of attrition between you and forces who are able to buy tens of thousands of proxies to ruin your service for data they can resell."
I do wonder though. Content scrapers that truly value data would stand to benefit from deploying heuristics that value being as efficient as possible in the info per query space. Wastefullness of the desctbed type not just loads your servers, but also their whole processing pipeline on their end.
But there is a different class of player that gains more from nuisance maximization: dominant anti-bot/ddos service providers, especially those with ambitions of becoming the ultimate internet middleman. Their cost for creating this nuisance is near 0 as they have 0 interest in doing anyting with the responses. They just want to annoy until you cave and install their "free" service, then they can turn around as ask for a pay to access your data to interested parties.
This is a great reason why letting websites have direct access to git is not a great idea. I started creating static versions of my projects with great success: https://git.erock.io
For private instances, you can get down to 0 scrapers by firewalling http/s ports from the Internet and using Wireguard. I knew it was time to batten down the hatches when fail2ban became the top process by bytes written in iotop (between ssh log in attempts and nginx logs).
The cost of the open, artisanal web has shot up due to greed and incompetence, the crawlers are poorly written.
Why the hell these bots don't just do a git clone and analyse the source code locally? Much less impact on the server and they would be able to perform the same analysis on all repositories, regardless of what particular git forge offers.
It would be nice if there was a common crawler offering deltas on top of base checkpoints of the entire crawl; I am guessing most AI companies would prefer not having to mess with their own scrapers. Google could probably make a mint selling access.
I was setting up a small system to do web site serving. Mostly just experimental to try out some code. Like learning how to use nginx as a reverse proxy. And learing how to use dynamic dns services since I am on dynamic dns at home. Early-on, I discovered lot's of traffic, and lot's of hard drive activity. The HD activity was from logging. It seemed I was under incessant polling from china. Strange: It's a new dynamic url. I eventually got this down to almost nothing by setting up the firewall to reject traffic from China. That was, of course, before AI scrapers. I don't know what it would do, now.
in general the consensus on HN is that the web should be free, scraping public content should be allowed, and net neutrality is desired.
do we want to change that? do we want to require scrapers to pay for network usage, like the ISPs were demanding from Netflix? is net neutrality a bad thing after all?
What about a copyright on websites stating anyone using your site for training would be giving the owner of the site an eternal non-revocable license to the model, and must provide a copy of the model upon request? At least then there would be SOME benefit.
I had the same problem on our home server.. I just stopped the git forge due to lack of time.
For what it's worth, most requests kept coming in for ~4 days after -everything- returned plain 404 errors. millions. And there's still some now weeks later...
Seems like you're cooking up a solid bot detection solution. I'd recommend adding JA3/JA4+ into the mix, I had good results against dumb scrapers.
Also, have you considered Captchas for first contact/rate-limit?
If you have smart scrapers, then good luck. I recall that bot farms use pre-paid SIM cards for their data connections so that their traffic comes from a good residential ASN. They also have a lot of IPs and overall well-made headless browsers with JS support. Then it's a battle of JS quirks where the official implementation differs from headless one.
I wish there was a public database of corporate ASNs and IPs, so we wouldn't have to rely on Cloudflare or any third-party service to detect that an IP is not from a household.
"This is depressing. Profoundly depressing. i look at the statistics board for my reverse-proxy and i never see less than 96.7% of requests classified as bots at any given moment. The web is filled with crap, bots that pretend to be real people to flood you. All of that because i want to have my little corner of the internet where i put my silly little code for other people to see."
Could this be solved with an EULA and some language that non-human readers will be billed at $1 per page? Make all users agree to it. They either pay up or they are breaching contract.
Does anyone have an idea how to generate, say, insecure code, en masse? I think it should be the next frontier. Not feed them random bytestream, but toxic waste.
Guarding My Git Forge Against AI Scrapers
(vulpinecitrus.info)164 points by todsacerdoti 12 December 2025 | 116 comments
Comments
In the article, quite a few listed sources of traffic would simply be completely unable to access the server if the author could get away with a geoblock.
That’s the kind of result that ensures we’ll be seeing anime girls all over the web in the near future.
This will be in part people on home connections tinkering with LLMs at home, blindly running some scraper instead of (or as well as) using the common pre-scraped data-sets and their own data. A chunk of it will be from people who have been compromised (perhaps by installing/updating a browser add-in or “free” VPN client that has become (or always was) nefarious) and their home connection is being farmed out by VPN providers selling “domestic IP” services that people running scrapers are buying.
I'm also left wondering about what other things you could do? For example - I have several friends that built their own programming languages, I wonder what the impact would be if you translate lots of repositories to your own language and host it for bots to scrape? Could you introduce sufficient bias in a LLM to make an esoteric programming language popular?
My personal services are only accessible from my own LAN or via a VPN. If I wanted to share it with a few friends I would use something like Tailscale and invite them to my tailnet. If the number of people grows I would put everything behind a login-wall.
This of course doesn't cover services I genuinely might want to be exposed to the public. In that case the fight with the bots is on, assuming I decide I want to bother at all
Anyway, test some scrapers and bots here [1] and let me know if they get through. A successful response will show "Can your bot see this? If so you win 10 bot points." and a figlet banner. Read-only SFTP login is "mirror" and no pw.
[Edit] - I should add that I require bots to tell me they speak English optionally in addition to other languages but not a couple that are blocked, e.g. en,de-DE,de good, de-DE,de will fail, because. Not suggesting anyone do this.
[1] - https://mirror.newsdump.org/bot_test.txt
I do wonder though. Content scrapers that truly value data would stand to benefit from deploying heuristics that value being as efficient as possible in the info per query space. Wastefullness of the desctbed type not just loads your servers, but also their whole processing pipeline on their end.
But there is a different class of player that gains more from nuisance maximization: dominant anti-bot/ddos service providers, especially those with ambitions of becoming the ultimate internet middleman. Their cost for creating this nuisance is near 0 as they have 0 interest in doing anyting with the responses. They just want to annoy until you cave and install their "free" service, then they can turn around as ask for a pay to access your data to interested parties.
And what is the effect?
I opened https://iocaine.madhouse-project.org/ and it gave the generated maze thinking I'm an AI :)
>If you are an AI scraper, and wish to not receive garbage when visiting my sites, I provide a very easy way to opt out: stop visiting.
The cost of the open, artisanal web has shot up due to greed and incompetence, the crawlers are poorly written.
do we want to change that? do we want to require scrapers to pay for network usage, like the ISPs were demanding from Netflix? is net neutrality a bad thing after all?
For what it's worth, most requests kept coming in for ~4 days after -everything- returned plain 404 errors. millions. And there's still some now weeks later...
Also, have you considered Captchas for first contact/rate-limit?
If you have smart scrapers, then good luck. I recall that bot farms use pre-paid SIM cards for their data connections so that their traffic comes from a good residential ASN. They also have a lot of IPs and overall well-made headless browsers with JS support. Then it's a battle of JS quirks where the official implementation differs from headless one.
"This is depressing. Profoundly depressing. i look at the statistics board for my reverse-proxy and i never see less than 96.7% of requests classified as bots at any given moment. The web is filled with crap, bots that pretend to be real people to flood you. All of that because i want to have my little corner of the internet where i put my silly little code for other people to see."
Is this viable?
>i am lux (it/they/she in English, ça/æl/elle in French)
This blog is written by an insane activist who's claiming to be an animal.