When privacy becomes training data
Researchers found millions of passports, credit cards, résumés, and faces in DataComp CommonPool, a massive AI training dataset scraped from the web.
Auditing just 0.1% revealed hundreds of millions of likely PII (personally identifiable information) items, including sensitive job and health details.
Despite face-blurring tools, researchers estimate 102 million faces were missed, and metadata/captions still expose names, addresses, and locations.
With over 2 million downloads, countless AI models may already be trained on this data, raising privacy and consent concerns.
Experts warn: if it’s online, it’s probably been scraped—highlighting the urgent need for new laws and ethical standards in AI data use.
Read the Technology Review article for more information.