Coursera - Machine Learning in Production - Week 2 - Section 3 - Data iteration
2025年01月04日
Data-centric Al development
Model-centric view
Take the data you have, and develop a model that does as well as possible on it.
Hold the data fixed and iteratively improve the code/model.
Data-centric view
The quality of the data is paramount. Use tools to improve the data quality; this will allow multiple models to do well.
Hold the code fixed and iteratively improve the data.
How can you make your data set even better? One of the most important ways to improve the quality of a data set is data augmentation.
It turns out that for unstructured data problems, pulling up one piece of this rubber sheet is unlikely to cause a different piece of the rubber sheet to dip down really far below. Instead, pulling up one point causes nearby points to be pulled up quite a lot and far away points may be pulled up a little bit, or if you're lucky, maybe more than a little bit.
Data augmentation
Goal:
Create realistic examples that (i) the algorithm does poorly on, but
(ii) humans (or other baseline) do well on
Checklist:
The rubber sheet analogy
Image example
You can take the image and flip it horizontally. This results in a pretty realistic image. The phone buttons are now on the other side, but this could be a useful example to add to your training set. Or you could implement contrast changes or actually brighten up the image here so the scratch is a little bit more visible. Or you could try darkening the image.
I've also used more advanced techniques like GANs, Generative Adversarial Networks to synthesize scratches like these automatically, although I found that techniques like that can also be overkill, meaning that there're simpler techniques that are much faster to implement that work just fine without the complexity of building a GAN to synthesize scratches.
Data iteration loop
Can adding data hurt performance?
For unstructured data problems, if:
For structure data problems, usually you have a fixed set of users or a fixed set of restaurants or fixed set of products, making it hard to use data augmentation or collect new data from new users that you don't have yet on restaurants that may or may not exist. Instead, adding features, can be a more fruitful way to improve the performance of the algorithm to fix problems like this one, identify through error analysis.
Other food delivery examples
Product recommendation:
Collaborative filtering -> Content based filtering
Over the last several years, there's been a trend in product recommendations of a shift from collaborative filtering approaches to what content based filtering approaches.
Collaborative filtering approaches is loosely an approach that looks at the user, tries to figure out who is similar to that user and then recommends things to you that people like you also liked.
In contrast, a content based filtering approach will tend to look at you as a person and look at the description of the restaurant or look at the menu of the restaurants and look at other information about the restaurant, to see if that restaurant is a good match for you or not.
The advantage of content based filtering is that even if there's a new restaurant or a new product that hardly anyone else has liked by actually looking at the description of the restaurant, rather than just looking at who else like the restaurants, you can more quickly make good recommendations. This is sometimes also called the cold-start problem.
How do you recommend a brand new product that almost no one else has purchased or like or dislike so far? And one of the ways to do that is to make sure that you capture good features for the things that you might want to recommend.
Unlike collaborative filtering, which requires a bunch of people to look at the product and decide if they like it or not, before it can decide whether a new user should be recommended the same product.
Data iteration
Experiment tracking
What to track?
Tracking tools
Desirable features
From Big Data to Good data
Try to ensure consistently high-quality data in all phases of the ML project lifecycle.
Good data:
Week 2: Select and Train Model
If you wish to dive more deeply into the topics covered this week, feel free to check out these optional references. You won’t have to read these to complete this week’s practice quizzes. Establishing a baseline
Error analysis
Experiment tracking
Papers
Brundage, M., Avin, S., Wang, J., Belfield, H., Krueger, G., Hadfield, G., … Anderljung, M. (n.d.). Toward trustworthy AI development: Mechanisms for supporting verifiable claims∗. Retrieved May 7, 2021 http://arxiv.org/abs/2004.07213v2
Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., & Sutskever, I. (2019). Deep double descent: Where bigger models and more data hurt. Retrieved from http://arxiv.org/abs/1912.02292
week 2
Week 2: Modeling Challenges and Strategies
Section 3: Data iteration
1. Data-centric AI development
Data-centric Al development
Model-centric view
Take the data you have, and develop a model that does as well as possible on it.
Hold the data fixed and iteratively improve the code/model.
Data-centric view
The quality of the data is paramount. Use tools to improve the data quality; this will allow multiple models to do well.
Hold the code fixed and iteratively improve the data.
How can you make your data set even better? One of the most important ways to improve the quality of a data set is data augmentation.
2. A useful picture of data augmentation
It turns out that for unstructured data problems, pulling up one piece of this rubber sheet is unlikely to cause a different piece of the rubber sheet to dip down really far below. Instead, pulling up one point causes nearby points to be pulled up quite a lot and far away points may be pulled up a little bit, or if you're lucky, maybe more than a little bit.
3. Data augmentation
Data augmentation
Goal:
Create realistic examples that (i) the algorithm does poorly on, but
(ii) humans (or other baseline) do well on
Checklist:
- Does it sound realistic?
- Is the x -> y mapping clear? (e.g., can humans recognize speech?)
- Is the algorithm currently doing poorly on it?
The rubber sheet analogy
Image example
You can take the image and flip it horizontally. This results in a pretty realistic image. The phone buttons are now on the other side, but this could be a useful example to add to your training set. Or you could implement contrast changes or actually brighten up the image here so the scratch is a little bit more visible. Or you could try darkening the image.
I've also used more advanced techniques like GANs, Generative Adversarial Networks to synthesize scratches like these automatically, although I found that techniques like that can also be overkill, meaning that there're simpler techniques that are much faster to implement that work just fine without the complexity of building a GAN to synthesize scratches.
Data iteration loop
4. Can adding data hurt?
Can adding data hurt performance?
For unstructured data problems, if:
- The model is large (low bias).
- The mapping x -> y is clear (e.g., given only the input x, humans can make accurate predictions).
5. Adding features
-For structure data problems, usually you have a fixed set of users or a fixed set of restaurants or fixed set of products, making it hard to use data augmentation or collect new data from new users that you don't have yet on restaurants that may or may not exist. Instead, adding features, can be a more fruitful way to improve the performance of the algorithm to fix problems like this one, identify through error analysis.
Other food delivery examples
- Only tea/coffee
- Only pizza
Product recommendation:
Collaborative filtering -> Content based filtering
Over the last several years, there's been a trend in product recommendations of a shift from collaborative filtering approaches to what content based filtering approaches.
Collaborative filtering approaches is loosely an approach that looks at the user, tries to figure out who is similar to that user and then recommends things to you that people like you also liked.
In contrast, a content based filtering approach will tend to look at you as a person and look at the description of the restaurant or look at the menu of the restaurants and look at other information about the restaurant, to see if that restaurant is a good match for you or not.
The advantage of content based filtering is that even if there's a new restaurant or a new product that hardly anyone else has liked by actually looking at the description of the restaurant, rather than just looking at who else like the restaurants, you can more quickly make good recommendations. This is sometimes also called the cold-start problem.
How do you recommend a brand new product that almost no one else has purchased or like or dislike so far? And one of the ways to do that is to make sure that you capture good features for the things that you might want to recommend.
Unlike collaborative filtering, which requires a bunch of people to look at the product and decide if they like it or not, before it can decide whether a new user should be recommended the same product.
Data iteration
- Error analysis can be harder if there is no good baseline (such as HLP) to compare to.
- Error analysis, user feedback and benchmarking to competitors can all provide inspiration for features to add.
6. Experiment tracking
Experiment tracking
What to track?
- Algorithm/code versioning
- Dataset used
- Hyperparameters
- Results
Tracking tools
- Text files
- Spreadsheet
- Experiment tracking system
Desirable features
- Information needed to replicate results
- Experiment results, ideally with summary metrics/analysis
- Perhaps also: Resource monitoring, visualization, model error analysis
7. From big data to good data
From Big Data to Good data
Try to ensure consistently high-quality data in all phases of the ML project lifecycle.
Good data:
- Covers important cases (good coverage of inputs x)
- Is defined consistently (definition of labels y is unambiguous)
- Has timely feedback from production data (distribution covers data drift and concept drift)
- Is sized appropriately
8. Week 2 Optional References
Week 2: Select and Train Model
If you wish to dive more deeply into the topics covered this week, feel free to check out these optional references. You won’t have to read these to complete this week’s practice quizzes. Establishing a baseline
Error analysis
Experiment tracking
Papers
Brundage, M., Avin, S., Wang, J., Belfield, H., Krueger, G., Hadfield, G., … Anderljung, M. (n.d.). Toward trustworthy AI development: Mechanisms for supporting verifiable claims∗. Retrieved May 7, 2021 http://arxiv.org/abs/2004.07213v2
Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., & Sutskever, I. (2019). Deep double descent: Where bigger models and more data hurt. Retrieved from http://arxiv.org/abs/1912.02292
week 2