When should data scientists try a new technique?

data science

Credit: Pixabay/CC0 Public Domain

If a scientist wants to forecast ocean currents to understand how pollution moves after an oil spill, she can use a common method of looking at currents that move between 10 and 10. 200 km. Or, she could choose a newer model that also includes shorter currents. This may be more accurate, but it may also require learning new software or running new computational tests. How to know if using the new method is worth the time, cost and effort?

A new approach developed by MIT researchers could help data scientist answer this question, whether they are considering statistics about ocean currents, violent crime, children’s literacy, or any other kind datasets.

The team created a new metric, called the “c-value,” that helps users choose between techniques based on the likelihood that a new method is more accurate for a particular data set. This measure answers the question “Is it likely that the new method is more accurate for this data than the conventional method?”

Traditionally, statisticians compare methods by averaging the accuracy of a method over all possible data sets. But just because a new method is better on average for all datasets doesn’t mean it will actually provide a better estimate when using a particular dataset. Medium is not application-specific.

So researchers from MIT and elsewhere created the c-value, which is a dataset-specific tool. A high c value means that the new method is unlikely to be less accurate than the original method for a particular data problem.

In their proof-of-concept paper, the researchers describe and evaluate c-values ​​using real-world data analysis problems: ocean current modeling, violent crime estimation performance in neighborhoods and approximate reading ability of students at school. They show how the c-value can help statisticians and data analysts achieve more accurate results by showing when to use alternative estimation methods they may have overlooked.

“What we’re trying to do with this particular work is find something data-specific. The classical concept of risk is really natural for someone developing a method. A person who wants their method to work well for all their users on average But a person using a method wants something that will work on their individual problem. shows that the c-value is a very real proof-of-concept in that direction,” said senior author Tamara Broderick, an associate professor in the Department of Electrical Engineering and Computer Science (EECS) and a member of the team. Fellow of the Decision and Information Systems Laboratory and the Institute of Data, Systems and Society.

She was joined on the paper by Brian Trippe, a former graduate student on Broderick’s group who is currently a postdoctoral fellow at Columbia University; and Sameer Deshpande, a former postdoctoral fellow in Broderick’s group, now an assistant professor at the University of Wisconsin at Madison. An accepted version of the article is posted online in Journal of the American Statistical Association.

Estimator Review

The c-value is designed to help with data problems where researchers seek to estimate an unknown parameter using a data set, such as estimating the average readability of a student. generated from a dataset of student assessment results and survey responses. A researcher has two estimation methods and must decide which method to use for this particular problem.

The better estimation method is the one that leads to less “loss”, which means the estimate will be closer to the underlying truth. Think again about predicting ocean currents: Perhaps a few meters per hour deviation isn’t so bad, but a multiple kilometers per hour error renders the estimate useless. However, the underlying truth remains unknown; scientists are trying to estimate it. Thus, one can never really compute the loss of an estimate for their particular data. That’s what makes comparing estimates so difficult. The c value helps the scientist overcome this challenge.

The c-value equation uses a specific data set to calculate the estimate with each method and then again to calculate the c-value between the methods. If the value of c is large, it is unlikely that the alternative method will be inferior and yield less accurate estimates than the original method.

“In our case, we assume that you want to stay conservative with the default estimator and that you only want to switch to the new estimator if you feel very confident about it. If you get a low c value you can’t say anything conclusive You could have actually done better, but you don’t know,” Broderick explained.

Theory test

The researchers tested that theory by evaluating three real-world data analysis problems.

First, they used the c-value to help determine which method is best for modeling ocean currents, a problem Trippe is solving. Accurate models are important for predicting the dispersion of contaminants, such as pollution from oil spills. The team found that estimating ocean currents using multiple scales, one larger and the other smaller, is likely to provide greater accuracy than using only larger scale measurements. .

“Ocean researchers are working on this, and the c-value can provide some statistical ‘boost’ to aid modeling at a smaller scale,” says Broderick.

In another example, researchers sought to predict violent crime in census tracts in Philadelphia, an application that Deshpande is working on. Using the c-value, they found that one could get a better estimate of the violent crime rate by incorporating census-level information on nonviolent crime into the analysis. They also use the c-value to show that further leveraging violent crime data from neighboring census tracts in the analysis is unlikely to yield further improvements in accuracy.

“That doesn’t mean there’s no improvement, it just means we don’t feel confident saying you’ll get it,” she said.

Now that they have demonstrated the theoretical c-value and show how it can be used to solve real-world data problems, the researchers want to extend the measure to a wider variety of data. more data and a broader set of model classes.

The ultimate goal is to create a metric that is general enough for many other data analytics problems, and while there’s still a lot of work to be done to that end, Broderick says it’s an important and exciting first step. position to go in the right direction.

More information:
Brian L. Trippe et al, Confidently compare the estimate with the c-value, Journal of the American Statistical Association (2022). DOI: 10.1080/01621459.2022.2153688

This story is republished with permission from MIT News (, a popular website covering MIT research, innovation, and teaching.

quote: When should data scientists try a new technique? (2023, 26 Jan) get January 26, 2023 from

This document is the subject for the collection of authors. Other than any fair dealing for private learning or research purposes, no part may be reproduced without written permission. The content provided is for informational purposes only.


News7F: Update the world's latest breaking news online of the day, breaking news, politics, society today, international mainstream news .Updated news 24/7: Entertainment, the World everyday world. Hot news, images, video clips that are updated quickly and reliably

Related Articles

Back to top button