Coursera - Machine Learning in Production - Week 2 - Section 3

Coursera - Machine Learning in Production - Week 2 - Section 3 - Data iteration

2025年01月04日

Week 2: Modeling Challenges and Strategies

Section 3: Data iteration

1. Data-centric AI development

Data-centric Al development

Model-centric view
Take the data you have, and develop a model that does as well as possible on it.

Hold the data fixed and iteratively improve the code/model.

Data-centric view
The quality of the data is paramount. Use tools to improve the data quality; this will allow multiple models to do well.

Hold the code fixed and iteratively improve the data.

How can you make your data set even better? One of the most important ways to improve the quality of a data set is data augmentation.

2. A useful picture of data augmentation

It turns out that for unstructured data problems, pulling up one piece of this rubber sheet is unlikely to cause a different piece of the rubber sheet to dip down really far below. Instead, pulling up one point causes nearby points to be pulled up quite a lot and far away points may be pulled up a little bit, or if you're lucky, maybe more than a little bit.

3. Data augmentation

Data augmentation

Goal:
Create realistic examples that (i) the algorithm does poorly on, but
(ii) humans (or other baseline) do well on

Checklist:

Does it sound realistic?
Is the x -> y mapping clear? (e.g., can humans recognize speech?)
Is the algorithm currently doing poorly on it?

The rubber sheet analogy

Image example

You can take the image and flip it horizontally. This results in a pretty realistic image. The phone buttons are now on the other side, but this could be a useful example to add to your training set. Or you could implement contrast changes or actually brighten up the image here so the scratch is a little bit more visible. Or you could try darkening the image.

I've also used more advanced techniques like GANs, Generative Adversarial Networks to synthesize scratches like these automatically, although I found that techniques like that can also be overkill, meaning that there're simpler techniques that are much faster to implement that work just fine without the complexity of building a GAN to synthesize scratches.

Data iteration loop

4. Can adding data hurt?

Can adding data hurt performance?

For unstructured data problems, if:

The model is large (low bias).
The mapping x -> y is clear (e.g., given only the input x, humans can make accurate predictions).

Then, adding data rarely hurts accuracy.

5. Adding features

-
For structure data problems, usually you have a fixed set of users or a fixed set of restaurants or fixed set of products, making it hard to use data augmentation or collect new data from new users that you don't have yet on restaurants that may or may not exist. Instead, adding features, can be a more fruitful way to improve the performance of the algorithm to fix problems like this one, identify through error analysis.

Other food delivery examples

Only tea/coffee
Only pizza

What are the added features that can help make a decision?
Product recommendation:
Collaborative filtering -> Content based filtering

Over the last several years, there's been a trend in product recommendations of a shift from collaborative filtering approaches to what content based filtering approaches.

Collaborative filtering approaches is loosely an approach that looks at the user, tries to figure out who is similar to that user and then recommends things to you that people like you also liked.

In contrast, a content based filtering approach will tend to look at you as a person and look at the description of the restaurant or look at the menu of the restaurants and look at other information about the restaurant, to see if that restaurant is a good match for you or not.

The advantage of content based filtering is that even if there's a new restaurant or a new product that hardly anyone else has liked by actually looking at the description of the restaurant, rather than just looking at who else like the restaurants, you can more quickly make good recommendations. This is sometimes also called the cold-start problem.

How do you recommend a brand new product that almost no one else has purchased or like or dislike so far? And one of the ways to do that is to make sure that you capture good features for the things that you might want to recommend.

Unlike collaborative filtering, which requires a bunch of people to look at the product and decide if they like it or not, before it can decide whether a new user should be recommended the same product.

Data iteration

Error analysis can be harder if there is no good baseline (such as HLP) to compare to.

Error analysis, user feedback and benchmarking to competitors can all provide inspiration for features to add.

6. Experiment tracking

Experiment tracking

What to track?

Algorithm/code versioning
Dataset used
Hyperparameters
Results

Tracking tools

Text files
Spreadsheet
Experiment tracking system

Desirable features

Information needed to replicate results
Experiment results, ideally with summary metrics/analysis
Perhaps also: Resource monitoring, visualization, model error analysis

7. From big data to good data

From Big Data to Good data
Try to ensure consistently high-quality data in all phases of the ML project lifecycle.

Good data:

Covers important cases (good coverage of inputs x)
Is defined consistently (definition of labels y is unambiguous)
Has timely feedback from production data (distribution covers data drift and concept drift)
Is sized appropriately

8. Week 2 Optional References

Week 2: Select and Train Model
If you wish to dive more deeply into the topics covered this week, feel free to check out these optional references. You won’t have to read these to complete this week’s practice quizzes. Establishing a baseline
Error analysis
Experiment tracking

Papers
Brundage, M., Avin, S., Wang, J., Belfield, H., Krueger, G., Hadfield, G., … Anderljung, M. (n.d.). Toward trustworthy AI development: Mechanisms for supporting verifiable claims∗. Retrieved May 7, 2021 http://arxiv.org/abs/2004.07213v2

Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., & Sutskever, I. (2019). Deep double descent: Where bigger models and more data hurt. Retrieved from http://arxiv.org/abs/1912.02292

week 2

Category: AI Tags: AI public

Sky Cone