Low-background Steel: content without AI contamination Hackernews Viewer

Low-background Steel: content without AI contamination

(blog.jgc.org)

403 points by jgrahamc 10 June 2025 | 266 comments

Comments

gojomo 10 June 2025

Look, we just need to add some new 'planes' to Unicode - that mirror all communicatively-useful characters, but with extra state bits for...

guaranteed human output - anyone who emits text in these ranges that was AI generated, rather than artisanally human-composed, goes straight to jail.

for human eyes only - anyone who lets any AI train on, or even consider, any text in these ranges goes straight to jail. Fnord, "that doesn't look like anything to me".

admittedly AI generated - all AI output must use these ranges as disclosure, or – you guessed it - those pretending otherwise go straight to jail.

Of course, all the ranges generate visually-indistinguishable homoglyphs, so it's a strictly-software-mediated quasi-covert channel for fair disclosure.

When you cut & paste text from various sources, the provenance comes with it via the subtle character encoding differences.

I am only (1 - epsilon) joking.

K0balt 10 June 2025

Ai generated content is inherently a regression to the mean and harms both training and human utility. There is no benefit in publishing anything that an AI can generate, just ask the question yourself. Maybe publish all AI content with <AI generated content> tags, but other than that it is a public nuisance much more often than a public good.

Legend2440 10 June 2025

I'm not convinced this is going to be as big of a deal as people think.

Long-run you want AI to learn from actual experience (think repairing cars instead of reading car repair manuals), which both (1. gives you an unlimited supply of noncopyrighted training data and (2. handily sidesteps the issue of AI-contaminated training data.

protocolture 11 June 2025

I like how the chosen terminology is perfectly picked to paint the concern as irrelevant.

"Since the end of atmospheric nuclear testing, background radiation has decreased to very near natural levels, making special low-background steel no longer necessary for most radiation-sensitive uses, as brand-new steel now has a low enough radioactive signature that it can generally be used."

I dont see that:

1. There will be a need for "uncontaminated" data. LLM data is probably slightly better than the natural background reddit comment. Falsehoods and all.

2. "Uncontaminated" data will be difficult to find. What with archive.org, gutenberg etc.

3. That LLM output is going to infest everything anyway.

ACCount36 10 June 2025

Currently, there is no reason to believe that "AI contamination" is a practical issue for AI training runs.

AIs trained on public scraped data that predates 2022 don't noticeably outperform those trained on scraped data from 2022 onwards. Hell, in some cases, newer scrapes perform slightly better, token for token, for unknown reasons.

schmookeeg 10 June 2025

I'm not as allergic to AI content as some (although I'm sure I'll get there) -- but I admire this analogy to low-background steel. Brilliant.

koolba 10 June 2025

I feel oddly prescient today: https://news.ycombinator.com/item?id=44217676

onecommentman 11 June 2025

Used paper books, especially poor-but-functional copies known as “reading copies” or “ex-library”, are going for a song on the used book market. Recommend starting your own physical book library, including basic reference texts, and supporting your local public and university libraries. Paper copies of articles in your areas of expertise and interest. Follow the ways of your ancestors.

I’ve had AIs outright lie about facts, and I’m glad to have had a physical library available to convince myself that I was correct, even if I couldn’t convince the AI of that in all cases.

nialv7 10 June 2025

Does this analogy work? It's exceedingly hard to make new low-background steels, since those radioactive particles are everywhere. But it's not difficult to make AI-free content - well just don't use AI to write it.

gorgoiler 10 June 2025

This site is literally named for the Y combinator! Module some philosophical hand waving, if there’s one thing we ought to demand of our inference models it’s the ability to find the fixed point of a function that takes content and outputs content, then consumes that same content!

I too am optimistic that recursive training on data that is a mixture of both original human content and content derived from original content, and content derived from content derived from original human content, …ad nauseam, will be able to extract the salient features and patterns of the underlying system.

submeta 11 June 2025

I have started to write „organic“ content again, as I am fed up with ultra polished super noisy texts by colleagues.

I realise that when I write (no so perfect) „organic“ content my colleagues enjoy it more. And as I am lazy, I get right to the point. No prelude, no „Summary“, just a few paragraphs of genuine ideas.

And I am sure this will be a trend again. Until maybe LLMs are trained to generate these kind of non-perfect, less noisy texts.

vunderba 10 June 2025

Was the choice to go with a very obviously AI generated image for the banner intentional? If I had to guess it almost looks like DALL-E version 2.

Ekaros 10 June 2025

Wouldn't actually curated content be still better? That is content were say lot of blogspam and and other content potentially generated by certain groups was removed? As I distinctly remember that lot of content even before AIs was very poor quality.

On other hand, lot of poor quality content could still be factually valid enough not just well edited or formatted.

ChrisArchitect 10 June 2025

Love the concept (and the historical story is neat too).

Came up a month or so ago on discussion about Wikipedia: Database Download (https://news.ycombinator.com/item?id=43811732). I missed that it was jgrahamc behind the site. Great stuff.

aunty_helen 10 June 2025

Any user profile created pre-2022 is low background steel. I’m now finding myself check date created when it seems like the user is outputting low quality content. Much to my dismay, I’m often wrong.

jeffchuber 10 June 2025

https://x.com/jeffreyhuber/status/1732069197847687658

swyx 10 June 2025

i put together a brief catalog of AI pollution of the web the last time this topic came up: https://www.latent.space/i/139368545/the-concept-of-low-back...

i do have to say outside of twitter i dont personally see it all that much. but the normies do seem to encounter it and 1) either are fine? 2) oblivious? and perhaps SOME non-human-origin noise is harmless.

(plenty of humans are pure noise, too, dont forget)

tomgag 11 June 2025

Interesting idea, I also mentioned the low-background analogy back in 2024:

https://gagliardoni.net/#ml_collapse_steel

https://infosec.exchange/@tomgag/111815723861443432

carlosjobim 10 June 2025

The shadow libraries are the largest and highest quality source of human knowledge, larger than the Internet in scope and actual content.

It is also uncontaminated by AI.

sorokod 10 June 2025

Elsewhere I proposed a "100% organic data" label for uncontaminated content. Should have a "100% organic data" logo too.

thm 10 June 2025

Animats 10 June 2025

Someone else pointed out the problem when I suggested, a few days ago, that it would be useful to have a LLM trained on public domain materials for which copyright has expired. The Great Books series, the out of copyright material in the Harvard libraries, that sort of thing.

That takes us back to the days when men were men, women were women, gays were criminals, trannies were crazy, and the sun never set on the British Empire.[1]

[1] https://www.smbc-comics.com/comic/copyright

Ferret7446 12 June 2025

This "problem" is self-contradicting.

If you can distinguish AI content, then you can just do that.

If you can't, what's the problem?

steve_gh 11 June 2025

And this is why the Wayback Machine is potentially the most valuable data on the internet

yodon 10 June 2025

Anyone who thinks their reading skills are a reliable detector of AI-generated content is either lying to themselves about the validity of their detector or missing the opportunity to print money by selling it.

I strongly suspect more people are in the first category than the second.

Crontab 10 June 2025

Off topic:

When I see a JGC link on Hacker News I can't help but remember using PopFile on an old PowerMac - back when Bayesian spam filters were becoming popular. It seems so long ago but it feels like yesterday.

blt 11 June 2025

tangentially, does anyone know a good way to limit web searches to the "low-background" era that integrates with address bar, OS right-click menus, etc? I often add a pre-2022 filter on searches manually in reaction to LLM junk results, but I'd prefer to have it on every search by default.

mclau157 10 June 2025

is this not just www.archive.org ?

vouaobrasil 10 June 2025

Like the idea but I'm not about to create a Tumblr account.