Published in AI

AI is running out of data to scrape

by on22 July 2024


Important sources are forbidden.

For years, people building powerful artificial intelligence systems have used enormous troves of text, images, and videos pulled from the Internet to train their models. Now, that data is drying up.

Over the past year, many of the most crucial web sources used for training A.I. models have restricted the use of their data, according to a study published this week by the Data Provenance Initiative, an M.I.T.-led research group.

The study, which looked at 14,000 web domains in three commonly used A.I. training data sets, discovered an "emerging crisis in consent" as publishers and online platforms have taken steps to prevent their data from being harvested.

The researchers estimate that in the three data sets – called C4, RefinedWeb and Dolma – five per cent of all data, and 25 per cent of data from the highest-quality sources, has been restricted.

Those restrictions are set up through the Robots Exclusion Protocol, a decades-old method for website owners to prevent automated bots from crawling their pages using a file called robots.txt. The study also found that as much as 45 per cent of the data in one set, C4, had been restricted by websites' terms of service.

The study's lead author, Shayne Longpre, said, "We're seeing a rapid decline in consent to use data across the web that will have ramifications not just for A.I. companies but also for researchers, academics, and noncommercial entities."

Last modified on 22 July 2024
Rate this item
(0 votes)