F1 Scores: The Key to AI Model Performance

The F1 score is a key metric used to evaluate the accuracy of a machine learning model. It helps us understand how well our models are performing, especially when dealing with imbalanced datasets.

At SmarterX, we use F1 scores to validate our models. It is a critical part of the process for us to bring a model or CPG decision to the market.

What is the F1 Score?

The F1 score is a measure of a model's accuracy that considers both precision and recall. It ranges from 0 to 1, with 1 being the best possible score.

Precision: This is the ratio of correctly predicted positive observations to the total predicted positives. It answers the question: "Of all the items the model identified as relevant, how many were actually relevant?"
Recall (or Sensitivity): This is the ratio of correctly predicted positive observations to all the observations that are actually positive. It answers the question: "Of all the relevant items, how many did the model identify correctly?"

The F1 score is the harmonic mean of precision and recall. It provides a single metric that balances both concerns, ensuring the model performs well in identifying relevant items without generating too many false alarms.

Why is the F1 Score Important?

Making accurate predictions is crucial. An F1 score gives us a reliable measure of our model's performance – and it removes subjectivity. For example, in regulatory compliance, we need to ensure that our models correctly identify products that fall under certain regulations without missing any or incorrectly flagging non-relevant products.

How is the F1 Score Calculated?

To calculate the F1 score, we first need to understand two concepts:

True Positives (TP): Items correctly identified as relevant.
False Positives (FP): Items incorrectly identified as relevant.
False Negatives (FN): Relevant items that were not identified.

The formulas are as follows:

Precision:

Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP} Precision=TP+FPTP

Recall:

Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN} Recall=TP+FNTP

F1 Score:

F1Score=2×(Precision×RecallPrecision+Recall)\text{F1 Score} = 2 \times \left( \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \right) F1Score=2×(Precision+RecallPrecision×Recall)

An Example

Let’s say we have a model that identifies defective products on a production line. Out of 100 products:

True Positives (TP) = 70
False Positives (FP) = 10
False Negatives (FN) = 20

We calculate:

Precision:

7070+10=7080=0.875\frac{70}{70 + 10} = \frac{70}{80} = 0.875 70+1070=8070=0.875

Recall:

7070+20=7090≈0.778\frac{70}{70 + 20} = \frac{70}{90} \approx 0.778 70+2070=9070≈0.778

F1 Score:

2×(0.875×0.7780.875+0.778)≈0.8242 \times \left( \frac{0.875 \times 0.778}{0.875 + 0.778} \right) \approx 0.824 2×(0.875+0.7780.875×0.778)≈0.824

This score tells us that our model is fairly accurate in identifying defective products, balancing both precision and recall. It’s up to you to determine what F1 is acceptable to you and your business – we’d keep refining if we got a .82 score.

‍