Machine Learning Model Evaluation Metrics

Now let's talk about some of the metrics we can look at to evaluate our models. We'll start with binary classification and then move to regression models.

Binary Classification Evaluation

Confusion Matrix

Let's take the example of binary classification with spam email detection. alt text
We have the true values from our labeled data - whether an email is spam or not spam. Our model makes predictions, and we can compare these predictions to the actual labels.

For example (Look into the image above):

First email: correctly classified as spam ✓
Second email: predicted spam, but actually wasn't spam ✗
Third email: wrong prediction ✗
Fourth email: correct prediction ✓
Fifth email: correct prediction ✓
Sixth email: wrong prediction ✗

We can compare the true values with what our model predicted and create what's called a confusion matrix.

Confusion Matrix Structure

A confusion matrix looks at the predictive value (positive for spam, negative for not spam) and compares it to the actual value from our training dataset:
alt text

True Positives (top-left): Predicted positive and actual value was positive
False Negatives (top-right): Predicted not spam, but actually was spam
False Positives (bottom-left): Predicted spam, but actually wasn't spam
True Negatives (bottom-right): Predicted not spam and actually was not spam

We want to maximize true positives and true negatives while minimizing false positives and false negatives.

How do we create this matrix?
To create this matrix, we look at our datasets (for example, 10,000 items we trained and predicted on) and count how many fall into each category.

Classification Metrics

From the confusion matrix, we can compute several metrics:

1. Precision

Formula: True Positives ÷ (True Positives + False Positives)
Measures: It is called precision because "If we find positives, how precise are we? How many times are we right about positives versus how many times are we wrong about positives in predicting?"

2. Recall

Formula: True Positives ÷ (True Positives + False Negatives)
Also known as True Positive Rate, and also Sensitivity
Measures: "How many times do we need to recall (walk back) our decision?"

3. F1 Score

Formula: 2 × (Precision × Recall) ÷ (Precision + Recall)
Widely used metric for confusion matrix evaluation

4. Accuracy

Here is the formula:
Accuracy = (True_Positive + True_Negative) ÷ (True_Positive + True_Negative + False_Positive + False_Negative)
but is rarely used

alt text

You don't need to remember the exact formula. You barely need to remember what the metrics mean. But what you need to remember is that precision, the recall, the F1, and the accuracy are metrics used to evaluate the accuracy of binary classification and this is what the exam will test you on

When to Use Which Metric

The choice of metric depends on what you're looking for:

"Costly" = Bad Consequences of Wrong Predictions. The "cost" isn't about which feature matters most - it's about which type of wrong answer causes more damage.

Precision: Best when false positives are costly
Recall: Best when false negatives are costly
F1 Score: Gives balance between precision and recall, especially useful for imbalanced datasets
Accuracy: Rarely used, only for balanced datasets

What do you mean by Balanced and Imbalanced Dataset? (See below)

Balanced vs Imbalanced Datasets:

Balanced dataset: Has balanced levels of classification for each category
Note that ==> Spam vs not-spam is typically not a balanced dataset

For more explanation in detail, see this link

AUC-ROC

AUC-ROC stands for Area Under the Curve for the Receiver-Operator Curve. It's more complicated, but just remember the name for the exam.

Value ranges from 0 to 1, with 1 being the perfect model
Compares sensitivity (true positive rates) to 1 minus specificity (false positive rates)
The ROC Curve has two axes:
Vertical axis: How often your model classifies actual spam as spam (sensitivity)
Horizontal axis: How often your model classifies not-spam as spam (1 - specificity)

About the Curve:

alt text

The curve shows multiple models, where a straight line represents a random model.
The more accurate your model, the more the curve leans toward the top-left.
AUC measures how much area is under the curve.

To draw this curve, you look at various thresholds in your model, vary the threshold with multiple confusion matrices, and plot this over time.

AUC-ROC is very useful when comparing thresholds and choosing the right model for binary classification.

To understand more, use this link

Confusion Matrix can be Multi-Dimensional

The confusion matrix can also be multi-dimensional.
That means that we can have multiple category for a classification and create a confusion matrix

alt text

Regression Evaluation

Now let's look at how we evaluate regression models.

Remember, this applies to cases alt text
like linear regression where we have data points and we're trying to find a line that represents these data points.

We measure accuracy by measuring the error, the error is the sum of distances between what the predicted value would've been and what the actual value is (See below the formulas for better understanding).

Green Color Line is the predicted value, and the actual values are the Blue Color Dots. Remember ==> Y Hat is the predicted value from the model, Y is the actual value

Regression Metrics

Just remember the names of these metrics, not necessarily how they work:

1. MAE (Mean Absolute Error)

Computes the difference between predicted and actual values as a mean of absolute values
Divide by the number of values you have

alt text

2. MAPE (Mean Absolute Percentage Error)

Instead of computing actual difference of values, computes how far off you are as a percentage
Same idea as MAE, but computing the average of percentages

So it is like take the difference (same as MAE) and then you need to divide the wholeby y-hat (predicted value)

alt text

3. RMSE (Root Mean Squared Error)

The idea is that you're trying to smooth out the error
RMSE is a way to evaluate the error for your regression

alt text

4. R Squared

Explains the variance in your model
If R squared is close to 1, your predictions are good

From an exam perspective, remember that MAE, MAPE, RMSE, and R-squared are metrics used to give the quality of a regression and to see if it is going to be acceptable for us or not. From model optimization point of view, we are going to try to minimize these errors' metrics, so that we know our model is accurate

Understanding Regression Metrics with Examples

Let's say you're trying to predict how well students did on a test based on how many hours they studied.

Error Measurement Metrics (MAE, MAPE, RMSE):

These show how "accurate" the model is
Example: If your RMSE is 5, that means on average, your model predictions will be about 5 points off from the actual student score
It is very Easy to quantify and measure

R Squared:

It Measures variance - a bit more difficult to understand
For Example: R squared of 0.8 means that 80% of changes in test scores can be explained by how much students studied (which was your input feature)
The remaining 20% is due to other factors like natural ability or luck
These other factors may not be captured by your model because they're not features in your model
Very good R squared close to 1 means you can explain almost everything of the target variable's variance thanks to your input features that you have

Key Takeaways

From an exam perspective:

For Classification: Use metrics from confusion matrix - accuracy, precision, recall, F1, and AUC-ROC
For Regression: Use MAE, MAPE, RMSE, and R squared for models that predict continuous values

alt text

The purpose of a confusion matrix is to evaluate the performance of models that do classifications.

For model optimization, we try to minimize these error metrics to ensure our model is accurate.

You should now understand which metrics are for classification and which are for regression, and have a high-level understanding of what these metrics do.

Sample MCQs for Reference:

Q: A data scientist wants to evaluate a regression model that must heavily penalize large errors. Which metric should they use?
✅ Answer: RMSE

Q: A team wants a regression metric that's easily understandable by a non-technical stakeholder and reports the average error in the same unit as the target variable. Which metric fits best?
✅ Answer: MAE

Q: Which regression metric explains how much of the variability in the data is captured by the model?
✅ Answer: R-squared

Keyboard shortcuts

AWS AI Practitioner Study Guide