Coursera - Machine Learning in Production - Week 2 - Section 2 - Error analysis and performance auditing

2025年01月03日

Week 2: Modeling Challenges and Strategies

Section 2: Error analysis and performance auditing

1. Error analysis example

Speech recognition example

Useful metrics for each tag

What fraction of errors has that tag?
Of all data with that tag, what fraction is misclassified?
What fraction of all the data has that tag?
How much room for improvement is there on data with that tag?

One example that you've already seen for how to do this analysis is to measure human level performance on data with that tag.

2. Prioritizing what to work on

Prioritizing what to work on

Type	Accuracy	Human level performance	Gap to HLP	% of data
Clean Speech	94%	95%	1%	60%	0.6%
Car Noise	89%	93%	4%	4%	0.16%
People Noise	87%	89%	2%	30%	0.6%
Low Bandwidth	70%	70%	0%	6%	~0%

Prioritizing what to work on

Decide on most important categories to work on based on:

How much room for improvement there is.
How frequently that category appears.
How easy is to improve accuracy in that category.
How important it is to improve in that category.

3. Skewed datasets

Confusion matrix: Precision and Recall

What happens with print("0")?

Combining precision and recall - F1 score

	Precision (P)	Recall (R)	F1
Model 1	88.3	79.1	83.4%
Model 2	97.0	7.3	13.6%

Multi-class metrics

multi-class classification problems

Classes: Scratch, Dent, Pit mark, Discoloration

Defect Type	Precision	Recall	F1
Scratch	82.1%	99.2%	89.8%
Dent	92.1%	99.5%	95.7%
Pit mark	85.3%	98.7%	91.5%
Discoloration	72.1%	97%	82.7%

4. Performance auditing

Auditing framework

Check for accuracy, fairness/bias, and other problems.
1. Brainstorm the ways the system might go wrong.

Performance on subsets of data (e.g., ethnicity, gender).
How common are certain errors (e.g., FP, FN).
Performance on rare classes.

2. Establish metrics to assess performance against these issues on appropriate slices of data.
3. Get business/product owner buy-in.

Speech recognition example

1. Brainstorm the ways the system might go wrong.

Accuracy on different genders and ethnicities.
Accuracy on different devices.
Prevalence of rude mis-transcriptions.

2. Establish metrics to assess performance against these issues on appropriate slices of data.

Mean accuracy for different genders and major accents.
Mean accuracy on different devices.
Check for prevalence of offensive words in the output.

week 2

Category: AI Tags: AI public

Sky Cone