Look, we just need to add some new 'planes' to Unicode - that mirror all communicatively-useful characters, but with extra state bits for...
guaranteed human output - anyone who emits text in these ranges that was AI generated, rather than artisanally human-composed, goes straight to jail.
for human eyes only - anyone who lets any AI train on, or even consider, any text in these ranges goes straight to jail. Fnord, "that doesn't look like anything to me".
admittedly AI generated - all AI output must use these ranges as disclosure, or – you guessed it - those pretending otherwise go straight to jail.
Of course, all the ranges generate visually-indistinguishable homoglyphs, so it's a strictly-software-mediated quasi-covert channel for fair disclosure.
When you cut & paste text from various sources, the provenance comes with it via the subtle character encoding differences.
Ai generated content is inherently a regression to the mean and harms both training and human utility. There is no benefit in publishing anything that an AI can generate, just ask the question yourself. Maybe publish all AI content with <AI generated content> tags, but other than that it is a public nuisance much more often than a public good.
I'm not convinced this is going to be as big of a deal as people think.
Long-run you want AI to learn from actual experience (think repairing cars instead of reading car repair manuals), which both (1. gives you an unlimited supply of noncopyrighted training data and (2. handily sidesteps the issue of AI-contaminated training data.
I like how the chosen terminology is perfectly picked to paint the concern as irrelevant.
"Since the end of atmospheric nuclear testing, background radiation has decreased to very near natural levels, making special low-background steel no longer necessary for most radiation-sensitive uses, as brand-new steel now has a low enough radioactive signature that it can generally be used."
I dont see that:
1. There will be a need for "uncontaminated" data. LLM data is probably slightly better than the natural background reddit comment. Falsehoods and all.
2. "Uncontaminated" data will be difficult to find. What with archive.org, gutenberg etc.
3. That LLM output is going to infest everything anyway.
Currently, there is no reason to believe that "AI contamination" is a practical issue for AI training runs.
AIs trained on public scraped data that predates 2022 don't noticeably outperform those trained on scraped data from 2022 onwards. Hell, in some cases, newer scrapes perform slightly better, token for token, for unknown reasons.
Used paper books, especially poor-but-functional copies known as “reading copies” or “ex-library”, are going for a song on the used book market. Recommend starting your own physical book library, including basic reference texts, and supporting your local public and university libraries. Paper copies of articles in your areas of expertise and interest. Follow the ways of your ancestors.
I’ve had AIs outright lie about facts, and I’m glad to have had a physical library available to convince myself that I was correct, even if I couldn’t convince the AI of that in all cases.
Does this analogy work? It's exceedingly hard to make new low-background steels, since those radioactive particles are everywhere. But it's not difficult to make AI-free content - well just don't use AI to write it.
This site is literally named for the Y combinator! Module some philosophical hand waving, if there’s one thing we ought to demand of our inference models it’s the ability to find the fixed point of a function that takes content and outputs content, then consumes that same content!
I too am optimistic that recursive training on data that is a mixture of both original human content and content derived from original content, and content derived from content derived from original human content, …ad nauseam, will be able to extract the salient features and patterns of the underlying system.
I have started to write „organic“ content again, as I am fed up with ultra polished super noisy texts by colleagues.
I realise that when I write (no so perfect) „organic“ content my colleagues enjoy it more. And as I am lazy, I get right to the point. No prelude, no „Summary“, just a few paragraphs of genuine ideas.
And I am sure this will be a trend again. Until maybe LLMs are trained to generate these kind of non-perfect, less noisy texts.
Wouldn't actually curated content be still better? That is content were say lot of blogspam and and other content potentially generated by certain groups was removed? As I distinctly remember that lot of content even before AIs was very poor quality.
On other hand, lot of poor quality content could still be factually valid enough not just well edited or formatted.
Love the concept (and the historical story is neat too).
Came up a month or so ago on discussion about Wikipedia: Database Download (https://news.ycombinator.com/item?id=43811732). I missed that it was jgrahamc behind the site. Great stuff.
Any user profile created pre-2022 is low background steel. I’m now finding myself check date created when it seems like the user is outputting low quality content. Much to my dismay, I’m often wrong.
i do have to say outside of twitter i dont personally see it all that much. but the normies do seem to encounter it and 1) either are fine? 2) oblivious? and perhaps SOME non-human-origin noise is harmless.
(plenty of humans are pure noise, too, dont forget)
Someone else pointed out the problem when I suggested, a few days ago, that it would be useful to have a LLM trained on public domain materials for which copyright has expired. The Great Books series, the out of copyright material in the Harvard libraries, that sort of thing.
That takes us back to the days when men were men, women were women, gays were criminals, trannies were crazy, and the sun never set on the British Empire.[1]
Anyone who thinks their reading skills are a reliable detector of AI-generated content is either lying to themselves about the validity of their detector or missing the opportunity to print money by selling it.
I strongly suspect more people are in the first category than the second.
When I see a JGC link on Hacker News I can't help but remember using PopFile on an old PowerMac - back when Bayesian spam filters were becoming popular. It seems so long ago but it feels like yesterday.
tangentially, does anyone know a good way to limit web searches to the "low-background" era that integrates with address bar, OS right-click menus, etc? I often add a pre-2022 filter on searches manually in reaction to LLM junk results, but I'd prefer to have it on every search by default.
Low-background Steel: content without AI contamination
(blog.jgc.org)403 points by jgrahamc 10 June 2025 | 266 comments
Comments
guaranteed human output - anyone who emits text in these ranges that was AI generated, rather than artisanally human-composed, goes straight to jail.
for human eyes only - anyone who lets any AI train on, or even consider, any text in these ranges goes straight to jail. Fnord, "that doesn't look like anything to me".
admittedly AI generated - all AI output must use these ranges as disclosure, or – you guessed it - those pretending otherwise go straight to jail.
Of course, all the ranges generate visually-indistinguishable homoglyphs, so it's a strictly-software-mediated quasi-covert channel for fair disclosure.
When you cut & paste text from various sources, the provenance comes with it via the subtle character encoding differences.
I am only (1 - epsilon) joking.
Long-run you want AI to learn from actual experience (think repairing cars instead of reading car repair manuals), which both (1. gives you an unlimited supply of noncopyrighted training data and (2. handily sidesteps the issue of AI-contaminated training data.
"Since the end of atmospheric nuclear testing, background radiation has decreased to very near natural levels, making special low-background steel no longer necessary for most radiation-sensitive uses, as brand-new steel now has a low enough radioactive signature that it can generally be used."
I dont see that:
1. There will be a need for "uncontaminated" data. LLM data is probably slightly better than the natural background reddit comment. Falsehoods and all.
2. "Uncontaminated" data will be difficult to find. What with archive.org, gutenberg etc.
3. That LLM output is going to infest everything anyway.
AIs trained on public scraped data that predates 2022 don't noticeably outperform those trained on scraped data from 2022 onwards. Hell, in some cases, newer scrapes perform slightly better, token for token, for unknown reasons.
I’ve had AIs outright lie about facts, and I’m glad to have had a physical library available to convince myself that I was correct, even if I couldn’t convince the AI of that in all cases.
I too am optimistic that recursive training on data that is a mixture of both original human content and content derived from original content, and content derived from content derived from original human content, …ad nauseam, will be able to extract the salient features and patterns of the underlying system.
I realise that when I write (no so perfect) „organic“ content my colleagues enjoy it more. And as I am lazy, I get right to the point. No prelude, no „Summary“, just a few paragraphs of genuine ideas.
And I am sure this will be a trend again. Until maybe LLMs are trained to generate these kind of non-perfect, less noisy texts.
On other hand, lot of poor quality content could still be factually valid enough not just well edited or formatted.
Came up a month or so ago on discussion about Wikipedia: Database Download (https://news.ycombinator.com/item?id=43811732). I missed that it was jgrahamc behind the site. Great stuff.
i do have to say outside of twitter i dont personally see it all that much. but the normies do seem to encounter it and 1) either are fine? 2) oblivious? and perhaps SOME non-human-origin noise is harmless.
(plenty of humans are pure noise, too, dont forget)
https://gagliardoni.net/#ml_collapse_steel
https://infosec.exchange/@tomgag/111815723861443432
It is also uncontaminated by AI.
That takes us back to the days when men were men, women were women, gays were criminals, trannies were crazy, and the sun never set on the British Empire.[1]
[1] https://www.smbc-comics.com/comic/copyright
If you can distinguish AI content, then you can just do that.
If you can't, what's the problem?
I strongly suspect more people are in the first category than the second.
When I see a JGC link on Hacker News I can't help but remember using PopFile on an old PowerMac - back when Bayesian spam filters were becoming popular. It seems so long ago but it feels like yesterday.