AI Privacy

When privacy becomes training data

Researchers found millions of passports, credit cards, résumés, and faces in DataComp CommonPool, a massive AI training dataset scraped from the web.

Auditing just 0.1% revealed hundreds of millions of likely PII (personally identifiable information) items, including sensitive job and health details.

Despite face-blurring tools, researchers estimate 102 million faces were missed, and metadata/captions still expose names, addresses, and locations.

With over 2 million downloads, countless AI models may already be trained on this data, raising privacy and consent concerns.

Experts warn: if it’s online, it’s probably been scraped—highlighting the urgent need for new laws and ethical standards in AI data use.

Read the Technology Review article for more information.