Coursera - Machine Learning in Production - Week 3 - Section 1 - Define Data and Establish Baseline
2025年01月05日
Major types of data problems
-
Unstructured vs. structured data
Unstructured data
Small data (≤10,000) vs. big data (>10,000)
Small data
Big data problems can have small data challenges too
Problems with a large dataset but where there's a long tail of rare events in the input will have small data challenges too.
Improving label consistency
Small data vs. big data (unstructured data)
Small data
Why measure HLP?
Estimate Bayes error / irreducible error to help with error analysis and prioritization.
Other uses of HLP
The problem with beating Human Level Performance as proof of machine learning superiority is multi fold.
Raising HLP
When the ground truth label is externally defined (e.g., biopsy), HLP gives an estimate for Bayes error / irreducible error.
But often ground truth is just another human label.
Raising HLP
HLP on structured data
Structured data problems are less likely to involve human labelers, thus HLP is less frequently used.
Some exceptions:
-
Week 3: Data Definition and Baseline
Section 1: Define Data and Establish Baseline
1. Why is data definition hard?
2. More label ambiguity examples
3. Major types of data problems
Major types of data problems
-
Unstructured | Structured | |||
Small data | Manufacturing visual inspection from 100 training examples |
Housing price prediction based on square footage, etc. from 50 training examples |
≤10,000 | Clean labels are critical. |
Big data | Speech recognition from 50 million training examples |
Online shopping recommendations for 1 million users |
>10,000 | Emphasis on data process. |
Humans can label data. Data augmentation. |
Harder to obtain more data. |
Unstructured vs. structured data
Unstructured data
- May or may not have huge collection of unlabeled examples x.
- Humans can label more data.
- Data augmentation more likely to be helpful.
- May be more difficult to obtain more data.
- Human labeling may not be possible (with some exceptions).
Small data (≤10,000) vs. big data (>10,000)
Small data
- Clean labels are critical.
- Can manually look through dataset and fix labels.
- Can get all the labelers to talk to each other.
- Emphasis data process.
4. Small data and label consistency
Big data problems can have small data challenges too
Problems with a large dataset but where there's a long tail of rare events in the input will have small data challenges too.
- Web search
- Self-driving cars
- Product recommendation systems
5. Improving label consistency
Improving label consistency
- Have multiple labelers label same example.
- When there is disagreement, have MLE, subject matter expert (SME) and/or labelers discuss definition of y to reach agreement.
- If labelers believe that x doesn't contain enough information, consider changing x.
- Iterate until it is hard to significantly increase agreement.
Small data vs. big data (unstructured data)
Small data
- Usually small number of labelers.
- Can ask labelers to discuss specific labels.
- Get to consistent definition with a small group.
- Then send labeling instructions to labelers.
- Can consider having multiple labelers label every example and using voting or consensus labels to increase accuracy.
6. Human level performance (HLP)
Why measure HLP?
Estimate Bayes error / irreducible error to help with error analysis and prioritization.
Other uses of HLP
- In academia, establish and beat a respectable benchmark to support publication.
- Business or product owner asks for 99% accuracy. HLP helps establish a more reasonable target.
- "Prove" the ML system is superior to humans doing the job and thus the business or product owner should adopt it.
The problem with beating Human Level Performance as proof of machine learning superiority is multi fold.
7. Raising HLP
Raising HLP
When the ground truth label is externally defined (e.g., biopsy), HLP gives an estimate for Bayes error / irreducible error.
But often ground truth is just another human label.
Raising HLP
- When the label y comes from a human label, HLP ≪ 100% may indicate ambiguous labeling instructions.
- Improving label consistency will raise HLP.
- This makes it harder for ML to beat HLP. But the more consistent labels will raise ML performance, which is ultimately likely to benefit the actual application performance.
HLP on structured data
Structured data problems are less likely to involve human labelers, thus HLP is less frequently used.
Some exceptions:
- User ID merging: Same person?
- Based on network traffic, is the computer hacked?
- Is the transaction fraudulent?
- Spam account? Bot?
- From GPS, what is the mode of transportation - on foot, bike, car, bus?
-