When privacy becomes training data

_ August 17, 2025

When privacy becomes training data

Researchers found millions of passports, credit cards, résumés, and faces in DataComp CommonPool, a massive AI training dataset scraped from the web.

Auditing just 0.1% revealed hundreds of millions of likely PII (personally identifiable information) items, including sensitive job and health details.

Despite face-blurring tools, researchers estimate 102 million faces were missed, and metadata/captions still expose names, addresses, and locations.

With over 2 million downloads, countless AI models may already be trained on this data, raising privacy and consent concerns.

Experts warn: if it’s online, it’s probably been scraped—highlighting the urgent need for new laws and ethical standards in AI data use.

Read the Technology Review article for more information.

Blog

Services

Solutions

Company

Subscribe

Blog

Clorox blames IT Firm for $380M hack after “Password Over the Phone” slip

Austria deals blow to “Pay or Okay” consent model

Services

Solutions

Company

Subscribe