AI tools are secretly training on real images of children

News7f June 10, 2024

0 113 2 minutes read

AI tools are secretly training on real images of children

More than 170 images and personal information of children from Brazil were collected by the open source data set without their knowledge or consent, and used to train AIstatement one new report from Human Rights Watch released Monday.

According to the report, the images were pulled from content posted as recently as 2023 and as far back as the mid-1990s, long before any Internet user could have predicted that their content had can be used to train AI. Human Rights Watch claims that these children’s personal information, along with links to their photos, were included in LAION-5B, a dataset that is a popular source of training data for AI startups.

“Their privacy is violated from the start when their photos are collected and scanned into these data sets. These AI tools are then trained on this data and can thus create realistic images of children,” said Hye Jung Han, technology and children’s rights researcher at the Watch Foundation Human rights and the researcher who found these images, said. “The technology is developed in such a way that any child who has any photo or video of himself online can be at risk because any malicious actor can take that photo and then that use these tools to manipulate them the way they want.”

LAION-5B is based on Common Crawl—a data warehouse created by crawling the web and made available to researchers—and has been used to train several AI models, including including Stability AI’s Stable Diffusion imaging engine. According to the website of the German non-profit organization LAION, the dataset is openly accessible and currently includes more than 5.85 billion image and caption pairs.

The images of children the researchers found came from moms’ blogs and other personal, maternity or parenting blogs, as well as stills from YouTube videos with small, seemingly small number of views. as uploaded to share with family and friends.

“Just looking at the context in which they are posted, they can expect and have some degree of privacy,” Hye said. “Most of these images cannot be found online through a reverse image search.”

LAION spokesman Nate Tyler said the organization took action. “LAION-5B was taken down after a Stanford report found links in the dataset pointing to illegal content on the public website,” he said. Protection, Stanford and Human Rights Watch to remove all known references to illegal content.”

of YouTube terms of service shaving is not allowed except in certain cases; These cases appear to violate those policies. “We have made clear that unauthorized scraping of YouTube content is a violation of our Terms of Service, and we continue to take action against this type of abuse,” said YouTube spokesman Jack Maon. .

In December, researchers at Stanford University have found AI training data collected by LAION-5B contained child sexual abuse material. The problem of deepfakes is apparently growing even among students in US schools where they are being used Bullying classmates, especially girls. Hye worries that, in addition to using photos of children to create child sexual abuse content, the database could reveal potentially sensitive information, such as locations or medical data . In 2022, an artist from the United States found his own image in the LAION datasetand realized it was from her personal medical records.