Coursera - Machine Learning in Production - Week 3 - Section 1 - Define Data and Establish Baseline

2025年01月05日


Week 3: Data Definition and Baseline


Section 1: Define Data and Establish Baseline


1. Why is data definition hard?


2. More label ambiguity examples


3. Major types of data problems


Major types of data problems
-
  Unstructured Structured    
Small data Manufacturing
visual inspection
from 100 training
examples
Housing price
prediction based
on square
footage, etc. from
50 training
examples
≤10,000 Clean labels are critical.
Big data Speech
recognition from
50 million training
examples
Online shopping
recommendations
for 1 million users
>10,000 Emphasis on data process.
  Humans can label data.
Data augmentation.
Harder to obtain more data.    

Unstructured vs. structured data

Unstructured data
  • May or may not have huge collection of unlabeled examples x.
  • Humans can label more data.
  • Data augmentation more likely to be helpful.
Structured data
  • May be more difficult to obtain more data.
  • Human labeling may not be possible (with some exceptions).

Small data (≤10,000) vs. big data (>10,000)

Small data
  • Clean labels are critical.
  • Can manually look through dataset and fix labels.
  • Can get all the labelers to talk to each other.
Big data
  • Emphasis data process.

4. Small data and label consistency


Big data problems can have small data challenges too

Problems with a large dataset but where there's a long tail of rare events in the input will have small data challenges too.
  • Web search
  • Self-driving cars
  • Product recommendation systems

5. Improving label consistency


Improving label consistency

  • Have multiple labelers label same example.
  • When there is disagreement, have MLE, subject matter expert (SME) and/or labelers discuss definition of y to reach agreement.
  • If labelers believe that x doesn't contain enough information, consider changing x.
  • Iterate until it is hard to significantly increase agreement.

Small data vs. big data (unstructured data)

Small data
  • Usually small number of labelers.
  • Can ask labelers to discuss specific labels.
Big data
  • Get to consistent definition with a small group.
  • Then send labeling instructions to labelers.
  • Can consider having multiple labelers label every example and using voting or consensus labels to increase accuracy.

6. Human level performance (HLP)


Why measure HLP?

Estimate Bayes error / irreducible error to help with error analysis and prioritization.

Other uses of HLP

  • In academia, establish and beat a respectable benchmark to support publication.
  • Business or product owner asks for 99% accuracy. HLP helps establish a more reasonable target.
  • "Prove" the ML system is superior to humans doing the job and thus the business or product owner should adopt it.

The problem with beating Human Level Performance as proof of machine learning superiority is multi fold.

7. Raising HLP


Raising HLP

When the ground truth label is externally defined (e.g., biopsy), HLP gives an estimate for Bayes error / irreducible error.

But often ground truth is just another human label.

Raising HLP

  • When the label y comes from a human label, HLP ≪ 100% may indicate ambiguous labeling instructions.
  • Improving label consistency will raise HLP.
  • This makes it harder for ML to beat HLP. But the more consistent labels will raise ML performance, which is ultimately likely to benefit the actual application performance.

HLP on structured data

Structured data problems are less likely to involve human labelers, thus HLP is less frequently used.

Some exceptions:
  • User ID merging: Same person?
  • Based on network traffic, is the computer hacked?
  • Is the transaction fraudulent?
  • Spam account? Bot?
  • From GPS, what is the mode of transportation - on foot, bike, car, bus?
-

-

Category: AI Tags: public

Upvote


Downvote