I just scraped data from reddit and other sources so i could build a nsfw classifier and chose to open source the data and the model for general good.
Note that i was a 1 year experienced engineer working solely on this project in my free time, so it was basically impossible for me to review or clear out the few csam images in the 100,000+ images in the dataset.
Although, now i wonder if i should never have open sourced the data. Would have avoided lot of these issues.
Back when the first moat creation gambit for AI failed (that they were creating SkyNet so the government needs to block anyone else from working on SkyNet since only OpenAI can be trusted to control it not just any rando) they moved onto the safety angle with the same idea. I recall seeing an infographic that all the major players were signed onto some kind of safety pledge, Meta, OpenAI, Microsoft, etc. Basically they didn't want anyone else training on the whole world's data because only they could be trusted to not do nefarious things with it. The infographic had a statement about not training on CSAM and revenge porn and the like but the corpospeak it was worded in made it sound like they were promising not to do it anymore, not that they never did.
I've tried to find this graphic against several times over the years but it's either been scrubbed from the internet or I just can't remember enough details to find it. Amusingly, it only just occurred to me that maybe I should ask ChatGPT to help me find it.
As a small point of order, they did not get banned for "finding CSAM" like the outrage- and clickbait title claims. They got banned for uploading a data set containing child porn to Google Drive. They did not find it themselves, and them later reporting the data set to an appropriate organization is not why they got banned.
Just a few days ago I was doing some low paid (well, not so low) Ai classification task - akin to mechanical turk ones - for a very big company and was - involuntarily, since I guess they don't review them before showing - shown an ai image by the platform depicting a naked man and naked kid. though it was more barbie like than anything else. I didn't really enjoy the view tbh, contacted them but got no answer back
This raises an interesting point. Do you need to train models using CSAM so that the model can self-enforce restrictions on CSAM? If so, I wonder what moral/ethical questions this brings up.
Being banned for pointing out ”Emperor’s new clothes”, is what autocrats typically do, because the worst thing they know is when anyone embarrass them.
Slightly unrelated, but I wonder if a 17-year old child sends her dirty photo to a 18-year old guy she likes, who goes to jail? Just curious how the law works if there is no "abuse" element.
Technofeudalism strikes again. MAANG can ban people at any time for anything without appeal, and sometimes at the whim of any nation state. Reversal is the rare exception, not the rule, and only happens occasionally due to public pressure.
A Developer Accidentally Found CSAM in AI Data. Google Banned Him for It
(404media.co)119 points by markatlarge 11 December 2025 | 92 comments
Comments
I just scraped data from reddit and other sources so i could build a nsfw classifier and chose to open source the data and the model for general good.
Note that i was a 1 year experienced engineer working solely on this project in my free time, so it was basically impossible for me to review or clear out the few csam images in the 100,000+ images in the dataset.
Although, now i wonder if i should never have open sourced the data. Would have avoided lot of these issues.
I've tried to find this graphic against several times over the years but it's either been scrubbed from the internet or I just can't remember enough details to find it. Amusingly, it only just occurred to me that maybe I should ask ChatGPT to help me find it.