Guarding My Git Forge Against AI Scrapers

(vulpinecitrus.info)

Comments

mappu 12 December 2025
Gitea has a builtin defense against this, `REQUIRE_SIGNIN_VIEW=expensive`, that completely stopped AI traffic issues for me and cut my VPS's bandwidth usage by 95%.
FabCH 12 December 2025
If you don't need global access, I have found that Geoblocking is the best first step. Especially if you are in a small country with a small footprint and you can get away at blocking the rest of the world. But even if you live in the US, excluding Russia, India, Iran and a few others will cut your traffic by double digit percent.

In the article, quite a few listed sources of traffic would simply be completely unable to access the server if the author could get away with a geoblock.

kstrauser 12 December 2025
Anubis cut the accesses on my little personal Forgejo instance with nothing particularly interesting on it from about 600K hits per day to about 1000.

That’s the kind of result that ensures we’ll be seeing anime girls all over the web in the near future.

dspillett 12 December 2025
> VNPT and Bunny Communications are home/mobile ISPs. i cannot ascertain for sure that their IPs are from domestic users, but it seems worrisome that these are among the top scraping sources once you remove the most obviously malicious actors.

This will be in part people on home connections tinkering with LLMs at home, blindly running some scraper instead of (or as well as) using the common pre-scraped data-sets and their own data. A chunk of it will be from people who have been compromised (perhaps by installing/updating a browser add-in or “free” VPN client that has become (or always was) nefarious) and their home connection is being farmed out by VPN providers selling “domestic IP” services that people running scrapers are buying.

dirkc 12 December 2025
I'm not 100% against AI, but I do cheer loudly when I see things like this!

I'm also left wondering about what other things you could do? For example - I have several friends that built their own programming languages, I wonder what the impact would be if you translate lots of repositories to your own language and host it for bots to scrape? Could you introduce sufficient bias in a LLM to make an esoteric programming language popular?

wrxd 12 December 2025
I wonder if this is going to push more and more services to be hidden from the public internet.

My personal services are only accessible from my own LAN or via a VPN. If I wanted to share it with a few friends I would use something like Tailscale and invite them to my tailnet. If the number of people grows I would put everything behind a login-wall.

This of course doesn't cover services I genuinely might want to be exposed to the public. In that case the fight with the bots is on, assuming I decide I want to bother at all

hashar 12 December 2025
I do not understand why the scrappers do not do it in a smarter way: clone the repositories and fetches from there on a daily or so basis. I have witnessed one going through every single blame and log links across all branches and redoing it every few hours! It sounds like they did not even tried to optimize their scrappers.
craftkiller 12 December 2025
On my forge, I mirror some large repos that I use for CI jobs so I'm not putting unfair load on the upstream project's repos. Those are the only repos large enough to cause problems with the asshole AI scrapers. My solution was to put the web interface for those repos behind oauth2-proxy (while leaving the direct git access open to not impact my CI jobs). It made my CPU usage drop 80% instantly, while still leaving my (significantly smaller) personal projects fully open for anyone to browse unimpeded.
Bender 12 December 2025
Do git clients support HTTP/2.0 yet? Or could they use SSH? I ask because I block most of the bots by requiring HTTP/2.0 even on my silliest of throw-away sites. I agree their caching method is good and should be done when much of the content is cachable. Blocking specific IP's is a never-ending game of whack-a-mole. I do block some data-centers ASN's as I do not expect real people to come from them even though they could. It's an acceptable trade-off for my junk. There are many things people can learn from capturing TCP SYN packets for a day and comparing to access logs sorting out bots vs legit people. There are quite a few headers that a browser will send that most bots do not. Many bots also lack sending a valid TCP MSS and TCP WINDOW.

Anyway, test some scrapers and bots here [1] and let me know if they get through. A successful response will show "Can your bot see this? If so you win 10 bot points." and a figlet banner. Read-only SFTP login is "mirror" and no pw.

[Edit] - I should add that I require bots to tell me they speak English optionally in addition to other languages but not a couple that are blocked, e.g. en,de-DE,de good, de-DE,de will fail, because. Not suggesting anyone do this.

[1] - https://mirror.newsdump.org/bot_test.txt

sodimel 12 December 2025
I, too, am selfhosting some projects on an old computer. And the fact that you can "hear internet" (with the fans going on) is really cool (unless you're trying to sleep while being scrapped).
PeterStuer 12 December 2025
"Self-hosting anything that is deemed "content" openly on the web in 2025 is a battle of attrition between you and forces who are able to buy tens of thousands of proxies to ruin your service for data they can resell."

I do wonder though. Content scrapers that truly value data would stand to benefit from deploying heuristics that value being as efficient as possible in the info per query space. Wastefullness of the desctbed type not just loads your servers, but also their whole processing pipeline on their end.

But there is a different class of player that gains more from nuisance maximization: dominant anti-bot/ddos service providers, especially those with ambitions of becoming the ultimate internet middleman. Their cost for creating this nuisance is near 0 as they have 0 interest in doing anyting with the responses. They just want to annoy until you cave and install their "free" service, then they can turn around as ask for a pay to access your data to interested parties.

qudat 12 December 2025
This is a great reason why letting websites have direct access to git is not a great idea. I started creating static versions of my projects with great success: https://git.erock.io
GoblinSlayer 12 December 2025
>Iocaine has served 38.16GB of garbage

And what is the effect?

I opened https://iocaine.madhouse-project.org/ and it gave the generated maze thinking I'm an AI :)

>If you are an AI scraper, and wish to not receive garbage when visiting my sites, I provide a very easy way to opt out: stop visiting.

overfeed 19 hours ago
For private instances, you can get down to 0 scrapers by firewalling http/s ports from the Internet and using Wireguard. I knew it was time to batten down the hatches when fail2ban became the top process by bytes written in iotop (between ssh log in attempts and nginx logs).

The cost of the open, artisanal web has shot up due to greed and incompetence, the crawlers are poorly written.

Artoooooor 15 hours ago
Why the hell these bots don't just do a git clone and analyse the source code locally? Much less impact on the server and they would be able to perform the same analysis on all repositories, regardless of what particular git forge offers.
benlivengood 22 hours ago
It would be nice if there was a common crawler offering deltas on top of base checkpoints of the entire crawl; I am guessing most AI companies would prefer not having to mess with their own scrapers. Google could probably make a mint selling access.
zoobab 12 December 2025
Use stagit, static pages served with a simple nginx is blazing fast and should resist any scrapers.
stevetron 12 December 2025
I was setting up a small system to do web site serving. Mostly just experimental to try out some code. Like learning how to use nginx as a reverse proxy. And learing how to use dynamic dns services since I am on dynamic dns at home. Early-on, I discovered lot's of traffic, and lot's of hard drive activity. The HD activity was from logging. It seemed I was under incessant polling from china. Strange: It's a new dynamic url. I eventually got this down to almost nothing by setting up the firewall to reject traffic from China. That was, of course, before AI scrapers. I don't know what it would do, now.
hurturue 12 December 2025
in general the consensus on HN is that the web should be free, scraping public content should be allowed, and net neutrality is desired.

do we want to change that? do we want to require scrapers to pay for network usage, like the ISPs were demanding from Netflix? is net neutrality a bad thing after all?

jepj57 12 December 2025
What about a copyright on websites stating anyone using your site for training would be giving the owner of the site an eternal non-revocable license to the model, and must provide a copy of the model upon request? At least then there would be SOME benefit.
captn3m0 12 December 2025
I switched to rgit instead of running Gitea.
evgpbfhnr 12 December 2025
I had the same problem on our home server.. I just stopped the git forge due to lack of time.

For what it's worth, most requests kept coming in for ~4 days after -everything- returned plain 404 errors. millions. And there's still some now weeks later...

ArcHound 12 December 2025
Seems like you're cooking up a solid bot detection solution. I'd recommend adding JA3/JA4+ into the mix, I had good results against dumb scrapers.

Also, have you considered Captchas for first contact/rate-limit?

If you have smart scrapers, then good luck. I recall that bot farms use pre-paid SIM cards for their data connections so that their traffic comes from a good residential ASN. They also have a lot of IPs and overall well-made headless browsers with JS support. Then it's a battle of JS quirks where the official implementation differs from headless one.

pabs3 12 December 2025
> the difference in power usage caused by scraping costs us ~60 euros a year
klaussilveira 12 December 2025
I wish there was a public database of corporate ASNs and IPs, so we wouldn't have to rely on Cloudflare or any third-party service to detect that an IP is not from a household.
yunnpp 23 hours ago
Thanks for putting that together. Not my daily cup but it seems like a good reference for server setup.
reactordev 22 hours ago
I host all my stuff behind a vpn. No one but authorized users can get access.
krupan 12 December 2025
I'm case you didn't read to the end:

"This is depressing. Profoundly depressing. i look at the statistics board for my reverse-proxy and i never see less than 96.7% of requests classified as bots at any given moment. The web is filled with crap, bots that pretend to be real people to flood you. All of that because i want to have my little corner of the internet where i put my silly little code for other people to see."

frogperson 12 December 2025
Could this be solved with an EULA and some language that non-human readers will be billed at $1 per page? Make all users agree to it. They either pay up or they are breaching contract.

Is this viable?

xyzal 12 December 2025
Does anyone have an idea how to generate, say, insecure code, en masse? I think it should be the next frontier. Not feed them random bytestream, but toxic waste.
frozenseven 18 hours ago
>When I pretend to be human

>i am lux (it/they/she in English, ça/æl/elle in French)

This blog is written by an insane activist who's claiming to be an animal.