In machine learning, aggregated data can make real performance improvements
Teaching a machine to recognize human actions has many potential applications, such as automatically detecting workers falling at a construction site or allowing smart home robots to interpret user gestures.
To do this, the researchers train machine learning models using a huge dataset of video clips showing humans performing actions. However, not only is it expensive and laborious to collect and label millions or billions of videos, but the clips often contain sensitive information, like other people’s faces or license plates. Use of these videos may also infringe copyright or data protection law. And this assumes that the video data is publicly available from the start — many of the datasets are owned by companies and aren’t free to use.
So researchers are turning to synthetic data sets. They are generated by a computer that uses 3D models of scenes, objects, and people to quickly create various clips of specific actions — without potential copyright issues or ethical concerns comes with real data.
But aggregate data as “good” as real data? How well does the model trained with these data perform when asked to classify real human actions? A team of researchers at MIT, the MIT-IBM Watson AI Lab, and Boston University sought to answer this question. They built a composite dataset of 150,000 video clips capturing a range of human actions, which were used to train machine learning models. They then showed these models six datasets of real-world video to see how well they could learn to recognize the actions in those clips.
The researchers found that synthetically trained models performed even better than models trained on real data for videos with fewer background objects.
This work could help researchers use synthetic data sets in such a way that models achieve greater accuracy in real-world tasks. It can also help scientists determine which machine learning applications might be best suited for training with aggregated data, in an effort to mitigate some of the ethical, privacy, and copyright concerns that come with it. using real datasets.
“The ultimate goal of our study is to replace real data pre-training with synthetic data pre-training. There is an overhead when creating an action in the aggregate data, but after it’s done , you can create an unlimited number of images or videos by altering Rogerio Feris, principal scientist and manager at MIT-IBM Watson AI Lab, and co-author of a paper detailing the study. this study, said.
Article by lead author Yo-whan “John” Kim ’22; Aude Oliva, director of strategic industry engagement at MIT Schwarzman Computer University, MIT director of the MIT-IBM Watson AI Lab, and a senior research scientist in the Intelligence and Computer Science Laboratory Artificial (CSAIL); and seven others. The research will be presented at the Conference on Neural Information Processing Systems.
Build aggregate data set
The researchers began by compiling a new dataset using three publicly available datasets of synthetic video clips that capture human actions. Their dataset, called Synthetic Action Pre-Train and Transfer (SynAPT), contains 150 action categories, with 1,000 video clips for each category.
They chose as many action categories as possible, such as people waving or falling on the floor, depending on the availability of clips with clean content. video data.
After the dataset is prepared, they use it to pre-train three machine learning model to recognize actions. Pre-training involves training a model for one task to help it get started on learning other tasks. Inspired by how people learn — we reuse old knowledge when we learn something new — pre-training models can use learned parameters to help the model learn a new task with new datasets faster and more efficiently.
They tested pre-trained models using six datasets of real video clips, each of which recorded classes of actions that differed from those in the training data.
The researchers were surprised to find that all three synthetic models performed better than models trained with real video clips on four of the six datasets. Their accuracy is highest for datasets containing video clips with “low scene object deviation”.
Low scene object offset means that the model cannot recognize action by looking at the background or other objects in the scene — it must focus on the action itself. For example, if a model is tasked with classifying diving positions in recordings of a person diving into the pool, it cannot determine the position by looking at the water or the bricks on the wall. It must focus on the movement and position of the person in order to categorize the action.
“In videos with low scene-object deviation, the temporal dynamics of actions are more important than the appearance of objects or the background, and that seems to be well documented with aggregated data,” says Feris. ‘ said Ferris.
“High scene object bias can really act as an obstacle. Models can misclassify an action by looking at an object, not the action itself. It can cause confusion for the model,” explained Kim.
Based on these results, the researchers want to include more action layers and additional composite video platforms in future work, ultimately creating a catalog of pre-trained models. using aggregated data, said co-author Rameswar Panda, a research fellow at MIT, said -BM Watson AI Lab.
“We wanted to build models with similar or even better performance than existing models in the literature, but without being bound by any biases or concerns,” he adds. any security.
They also want to combine their work with research to create more accurate and realistic composite videos that can boost the performance of the models, said SouYoung Jin, co-author and poster of CSAIL. She is also interested in discovering how models can learn differently when they are trained with aggregated data.
“We use synthetic datasets to prevent privacy issues or contextual or social biases, but what does the model actually learn? Did it learn something unnatural? taste?” she speaks.
Now that they’ve demonstrated this potential use for composite videos, they hope other researchers will build on their work.
“Although there is a lower cost to obtain well-annotated aggregate data, we currently do not have data set at scale to match the largest annotated dataset with real video. By discussing the various costs and concerns with real-life videos and showing the effectiveness of aggregated data, we hope to advance efforts in this direction,” said co-author Samarth Mishra , a PhD student at Boston University (BU), adds.
Full paper (PDF): How can video presentations based on aggregate data be delivered?
Powered by the MIT Computer Science & Artificial Intelligence Laboratory
Quote: In machine learning, aggregated data can provide real performance improvements (2022, Nov 3) retrieved Nov 4, 2022 from https://techxplore.com/news/2022- 11-machine-synthetic-real.html
This document is the subject for the collection of authors. Apart from any fair dealing for personal study or research purposes, no part may be reproduced without written permission. The content provided is for informational purposes only.