Coursera - Machine Learning in Production - Week 3 - Section 1

Coursera - Machine Learning in Production - Week 3 - Section 1 - Define Data and Establish Baseline

2025年01月05日

Week 3: Data Definition and Baseline

Section 1: Define Data and Establish Baseline

1. Why is data definition hard?

2. More label ambiguity examples

3. Major types of data problems

Major types of data problems
-

	Unstructured	Structured
Small data	Manufacturing visual inspection from 100 training examples	Housing price prediction based on square footage, etc. from 50 training examples	≤10,000	Clean labels are critical.
Big data	Speech recognition from 50 million training examples	Online shopping recommendations for 1 million users	>10,000	Emphasis on data process.
	Humans can label data. Data augmentation.	Harder to obtain more data.

Unstructured vs. structured data

Unstructured data

May or may not have huge collection of unlabeled examples x.
Humans can label more data.
Data augmentation more likely to be helpful.

Structured data

May be more difficult to obtain more data.
Human labeling may not be possible (with some exceptions).

Small data (≤10,000) vs. big data (>10,000)

Small data

Clean labels are critical.
Can manually look through dataset and fix labels.
Can get all the labelers to talk to each other.

Big data

Emphasis data process.

4. Small data and label consistency

Big data problems can have small data challenges too

Problems with a large dataset but where there's a long tail of rare events in the input will have small data challenges too.

Web search
Self-driving cars
Product recommendation systems

5. Improving label consistency

Improving label consistency

Have multiple labelers label same example.
When there is disagreement, have MLE, subject matter expert (SME) and/or labelers discuss definition of y to reach agreement.
If labelers believe that x doesn't contain enough information, consider changing x.
Iterate until it is hard to significantly increase agreement.

Small data vs. big data (unstructured data)

Small data

Usually small number of labelers.
Can ask labelers to discuss specific labels.

Big data

Get to consistent definition with a small group.
Then send labeling instructions to labelers.
Can consider having multiple labelers label every example and using voting or consensus labels to increase accuracy.

6. Human level performance (HLP)

Why measure HLP?

Estimate Bayes error / irreducible error to help with error analysis and prioritization.

Other uses of HLP

In academia, establish and beat a respectable benchmark to support publication.
Business or product owner asks for 99% accuracy. HLP helps establish a more reasonable target.
"Prove" the ML system is superior to humans doing the job and thus the business or product owner should adopt it.

The problem with beating Human Level Performance as proof of machine learning superiority is multi fold.

7. Raising HLP

Raising HLP

When the ground truth label is externally defined (e.g., biopsy), HLP gives an estimate for Bayes error / irreducible error.

But often ground truth is just another human label.

Raising HLP

When the label y comes from a human label, HLP ≪ 100% may indicate ambiguous labeling instructions.
Improving label consistency will raise HLP.
This makes it harder for ML to beat HLP. But the more consistent labels will raise ML performance, which is ultimately likely to benefit the actual application performance.

HLP on structured data

Structured data problems are less likely to involve human labelers, thus HLP is less frequently used.

Some exceptions:

User ID merging: Same person?
Based on network traffic, is the computer hacked?
Is the transaction fraudulent?
Spam account? Bot?
From GPS, what is the mode of transportation - on foot, bike, car, bus?

-

-

Category: AI Tags: AI public

Sky Cone