Guarding My Git Forge Against AI Scrapers Hackernews Viewer

Guarding My Git Forge Against AI Scrapers

182 points by todsacerdoti 12 December 2025 | 131 comments

Comments

mappu 12 December 2025

Gitea has a builtin defense against this, `REQUIRE_SIGNIN_VIEW=expensive`, that completely stopped AI traffic issues for me and cut my VPS's bandwidth usage by 95%.

FabCH 12 December 2025

If you don't need global access, I have found that Geoblocking is the best first step. Especially if you are in a small country with a small footprint and you can get away at blocking the rest of the world. But even if you live in the US, excluding Russia, India, Iran and a few others will cut your traffic by double digit percent.

In the article, quite a few listed sources of traffic would simply be completely unable to access the server if the author could get away with a geoblock.

kstrauser 12 December 2025

Anubis cut the accesses on my little personal Forgejo instance with nothing particularly interesting on it from about 600K hits per day to about 1000.

That’s the kind of result that ensures we’ll be seeing anime girls all over the web in the near future.

Bender 12 December 2025

Do git clients support HTTP/2.0 yet? Or could they use SSH? I ask because I block most of the bots by requiring HTTP/2.0 even on my silliest of throw-away sites. I agree their caching method is good and should be done when much of the content is cachable. Blocking specific IP's is a never-ending game of whack-a-mole. I do block some data-centers ASN's as I do not expect real people to come from them even though they could. It's an acceptable trade-off for my junk. There are many things people can learn from capturing TCP SYN packets for a day and comparing to access logs sorting out bots vs legit people. There are quite a few headers that a browser will send that most bots do not. Many bots also lack sending a valid TCP MSS and TCP WINDOW.

Anyway, test some scrapers and bots here [1] and let me know if they get through. A successful response will show "Can your bot see this? If so you win 10 bot points." and a figlet banner. Read-only SFTP login is "mirror" and no pw.

[Edit] - I should add that I require bots to tell me they speak English optionally in addition to other languages but not a couple that are blocked, e.g. en,de-DE,de good, de-DE,de will fail, because. Not suggesting anyone do this.

[1] - https://mirror.newsdump.org/bot_test.txt

dirkc 12 December 2025

I'm not 100% against AI, but I do cheer loudly when I see things like this!

I'm also left wondering about what other things you could do? For example - I have several friends that built their own programming languages, I wonder what the impact would be if you translate lots of repositories to your own language and host it for bots to scrape? Could you introduce sufficient bias in a LLM to make an esoteric programming language popular?

dspillett 12 December 2025

> VNPT and Bunny Communications are home/mobile ISPs. i cannot ascertain for sure that their IPs are from domestic users, but it seems worrisome that these are among the top scraping sources once you remove the most obviously malicious actors.

This will be in part people on home connections tinkering with LLMs at home, blindly running some scraper instead of (or as well as) using the common pre-scraped data-sets and their own data. A chunk of it will be from people who have been compromised (perhaps by installing/updating a browser add-in or “free” VPN client that has become (or always was) nefarious) and their home connection is being farmed out by VPN providers selling “domestic IP” services that people running scrapers are buying.

craftkiller 12 December 2025

On my forge, I mirror some large repos that I use for CI jobs so I'm not putting unfair load on the upstream project's repos. Those are the only repos large enough to cause problems with the asshole AI scrapers. My solution was to put the web interface for those repos behind oauth2-proxy (while leaving the direct git access open to not impact my CI jobs). It made my CPU usage drop 80% instantly, while still leaving my (significantly smaller) personal projects fully open for anyone to browse unimpeded.

hashar 12 December 2025

I do not understand why the scrappers do not do it in a smarter way: clone the repositories and fetches from there on a daily or so basis. I have witnessed one going through every single blame and log links across all branches and redoing it every few hours! It sounds like they did not even tried to optimize their scrappers.

PeterStuer 12 December 2025

"Self-hosting anything that is deemed "content" openly on the web in 2025 is a battle of attrition between you and forces who are able to buy tens of thousands of proxies to ruin your service for data they can resell."

I do wonder though. Content scrapers that truly value data would stand to benefit from deploying heuristics that value being as efficient as possible in the info per query space. Wastefullness of the desctbed type not just loads your servers, but also their whole processing pipeline on their end.

But there is a different class of player that gains more from nuisance maximization: dominant anti-bot/ddos service providers, especially those with ambitions of becoming the ultimate internet middleman. Their cost for creating this nuisance is near 0 as they have 0 interest in doing anyting with the responses. They just want to annoy until you cave and install their "free" service, then they can turn around as ask for a pay to access your data to interested parties.

wrxd 12 December 2025

I wonder if this is going to push more and more services to be hidden from the public internet.

My personal services are only accessible from my own LAN or via a VPN. If I wanted to share it with a few friends I would use something like Tailscale and invite them to my tailnet. If the number of people grows I would put everything behind a login-wall.

This of course doesn't cover services I genuinely might want to be exposed to the public. In that case the fight with the bots is on, assuming I decide I want to bother at all

overfeed 12 December 2025

For private instances, you can get down to 0 scrapers by firewalling http/s ports from the Internet and using Wireguard. I knew it was time to batten down the hatches when fail2ban became the top process by bytes written in iotop (between ssh log in attempts and nginx logs).

The cost of the open, artisanal web has shot up due to greed and incompetence, the crawlers are poorly written.

qudat 12 December 2025

This is a great reason why letting websites have direct access to git is not a great idea. I started creating static versions of my projects with great success: https://git.erock.io

sodimel 12 December 2025

I, too, am selfhosting some projects on an old computer. And the fact that you can "hear internet" (with the fans going on) is really cool (unless you're trying to sleep while being scrapped).

GoblinSlayer 12 December 2025

>Iocaine has served 38.16GB of garbage

And what is the effect?

I opened https://iocaine.madhouse-project.org/ and it gave the generated maze thinking I'm an AI :)

>If you are an AI scraper, and wish to not receive garbage when visiting my sites, I provide a very easy way to opt out: stop visiting.

zoobab 12 December 2025

Use stagit, static pages served with a simple nginx is blazing fast and should resist any scrapers.

Artoooooor 13 December 2025

Why the hell these bots don't just do a git clone and analyse the source code locally? Much less impact on the server and they would be able to perform the same analysis on all repositories, regardless of what particular git forge offers.

stevetron 12 December 2025

I was setting up a small system to do web site serving. Mostly just experimental to try out some code. Like learning how to use nginx as a reverse proxy. And learing how to use dynamic dns services since I am on dynamic dns at home. Early-on, I discovered lot's of traffic, and lot's of hard drive activity. The HD activity was from logging. It seemed I was under incessant polling from china. Strange: It's a new dynamic url. I eventually got this down to almost nothing by setting up the firewall to reject traffic from China. That was, of course, before AI scrapers. I don't know what it would do, now.

lgeek 13 December 2025

> Worryingly, VNPT and Bunny Communications are home/mobile ISPs

VNPT is a residential / mobile ISP, but they also run datacentres (e.g. [1]) and offer VPS, dedicated server rentals, etc. Most companies would use separate ASes for residential vs hosting use, but I guess they don't, which would make them very attractive to someone deploying crawlers.

And Bunny Communications (AS5065) is a pretty obvious 'residential' VPN / proxy provider trying to trick IP geolocation / reputation providers. Just look at the website [2], it's very low effort. They have a page literally called 'Sample page' up and the 'Blog' is all placeholder text, e.g. 'The Art of Drawing Readers In: Your attractive post title goes here'.

Another hint is that some of their upstreams are server-hosting companies rather than transit providers that a consumer ISP would use [3].

[1] https://vnpt.vn/doanh-nghiep/tu-van/vnpt-idc-data-center-gia... [2] https://bunnycommunications.com/ [3] https://bgp.tools/as/5065#upstreams

hurturue 12 December 2025

in general the consensus on HN is that the web should be free, scraping public content should be allowed, and net neutrality is desired.

do we want to change that? do we want to require scrapers to pay for network usage, like the ISPs were demanding from Netflix? is net neutrality a bad thing after all?

cookiengineer 14 December 2025

I don't think that the author's proposed cat and mouse game is giving you any chance, because it requires a lot of maintenance and architectural changes. And the proposed changes and tools all run in userspace, so there's still the DDoS problem.

I have the same problem, but I decided to maintain ASN lists of known spammers [1] and combine that with my eBPF based firewall that just drops their connections before it reaches the kernel [2].

So my websites, wikis and other things are protected by the same firewall architecture, for which I can deploy a unified "blockmap" so to speak. Probably gonna open source the dashboard for maintaining that over the holidays, too, as I'm trying to make everything combinable in the plug and play for Go backends sense similar to my markdown editor UI [3].

I also open sourced my LPM hashset map library which allows to process large quantities of prefixes, because it's way faster than LPM tries (read as: takes less than 100ms to process all RIR and WHOIS data compared to around an hour with LPM tries) [4].

[1] https://github.com/cookiengineer/antispam

[2] https://github.com/tholian-network/firewall

[3] https://github.com/cookiengineer/golocron

[4] https://github.com/cookiengineer/golpm

captn3m0 12 December 2025

I switched to rgit instead of running Gitea.

jepj57 12 December 2025

What about a copyright on websites stating anyone using your site for training would be giving the owner of the site an eternal non-revocable license to the model, and must provide a copy of the model upon request? At least then there would be SOME benefit.

benlivengood 12 December 2025

It would be nice if there was a common crawler offering deltas on top of base checkpoints of the entire crawl; I am guessing most AI companies would prefer not having to mess with their own scrapers. Google could probably make a mint selling access.

xyzal 12 December 2025

Does anyone have an idea how to generate, say, insecure code, en masse? I think it should be the next frontier. Not feed them random bytestream, but toxic waste.

ArcHound 12 December 2025

Seems like you're cooking up a solid bot detection solution. I'd recommend adding JA3/JA4+ into the mix, I had good results against dumb scrapers.

Also, have you considered Captchas for first contact/rate-limit?

If you have smart scrapers, then good luck. I recall that bot farms use pre-paid SIM cards for their data connections so that their traffic comes from a good residential ASN. They also have a lot of IPs and overall well-made headless browsers with JS support. Then it's a battle of JS quirks where the official implementation differs from headless one.

evgpbfhnr 12 December 2025

I had the same problem on our home server.. I just stopped the git forge due to lack of time.

For what it's worth, most requests kept coming in for ~4 days after -everything- returned plain 404 errors. millions. And there's still some now weeks later...

pabs3 12 December 2025

> the difference in power usage caused by scraping costs us ~60 euros a year

krupan 12 December 2025

I'm case you didn't read to the end:

"This is depressing. Profoundly depressing. i look at the statistics board for my reverse-proxy and i never see less than 96.7% of requests classified as bots at any given moment. The web is filled with crap, bots that pretend to be real people to flood you. All of that because i want to have my little corner of the internet where i put my silly little code for other people to see."

klaussilveira 12 December 2025

I wish there was a public database of corporate ASNs and IPs, so we wouldn't have to rely on Cloudflare or any third-party service to detect that an IP is not from a household.

yunnpp 12 December 2025

Thanks for putting that together. Not my daily cup but it seems like a good reference for server setup.

reactordev 12 December 2025

I host all my stuff behind a vpn. No one but authorized users can get access.

frogperson 12 December 2025

Could this be solved with an EULA and some language that non-human readers will be billed at $1 per page? Make all users agree to it. They either pay up or they are breaching contract.

Is this viable?

frozenseven 13 December 2025

[flagged]