Machine Learning

F1 Score in Machine Learning

Learn the F1 Score in simple, beginner-friendly language. Understand precision, recall, examples, mistakes, and when to use F1 in machine learning.

hans volkers

Dec 9, 2025

0 1479

F1 Score in Machine Learning

Content ▾

Why Accuracy Isn’t Enough (The Truth No One Tells Beginners)

Imagine you build a model that predicts whether a bank transaction is fraudulent.

Out of 10,000 transactions, only 10 are actually fraud.
If your model simply predicts “No fraud” every single time, guess what?

You get 99.9% accuracy.

Amazing, right?
No. That model is completely useless.

This is the moment every beginner realizes:

Accuracy only works when data is balanced.
When classes are uneven, accuracy lies.

This is where the F1 Score enters the hero metric that saves you from misleading numbers.

The Confusion Matrix (A Simple Table That Explains Everything)

Every classification model’s performance depends on four numbers:

Term	Meaning (Simple Explanation)
TP (True Positive)	Model correctly predicted “Yes”
FP (False Positive)	Model predicted “Yes” but it was “No”
FN (False Negative)	Model predicted “No” but it was “Yes”
TN (True Negative)	Model correctly predicted “No”

Think of a medical test:

TP: Sick person correctly identified
FN: Sick person missed → very dangerous
FP: Healthy person incorrectly told they’re sick → panic, retest
TN: Healthy person correctly identified

Everything we learn next Precision, Recall, and F1 comes from these four values.

Precision and Recall: The Two Most Important Words in ML

Precision → “When I predict Positive, how often am I right?”

Example: A spam filter flags 100 emails as spam.
If 90 are actually spam → High precision.

Recall → “How many actual positives did I catch?”

If 100 spam emails exist and the filter catches only 50 → recall = 0.5.

Both matter but both can fail alone.

You can have:

High precision, low recall
High recall, low precision

F1 Score fixes this imbalance.

What Is the F1 Score? (Zero Jargon)

The F1 Score is designed to solve a problem that accuracy can never fix:

a model can be “accurate” but still completely useless on imbalanced data.

Precision and recall each tell only half the story:

High recall, low precision → You catch many positives but also make many wrong predictions.
High precision, low recall → You avoid mistakes but miss many real positives.

Both situations are bad in real-world machine learning.

This is why the F1 Score combines precision and recall into ONE balanced number.
It forces your model to be both:

✔ good at catching positives
✔ good at being correct when predicting positives

In simple words:

F1 Score tells you how well your model really performs when accuracy fails—especially when your dataset is imbalanced.

The F1 Formula (Explained Without Fear)

The F1 Score is calculated as:

F1 = 2 × (Precision × Recall) / (Precision + Recall)

It uses the harmonic mean, not the normal average, for a very important reason:

The harmonic mean punishes imbalance.

If precision is high but recall is low, or the other way around, the F1 Score drops sharply.

Example:

Precision = 1.0
Recall = 0

Even though precision is perfect, recall is zero so:

F1 Score = 0

This tells you instantly:

“The model is not actually performing well.”

The harmonic mean forces both precision and recall to be good at the same time.

✔ If one value collapses, F1 collapses.

✔ If one value is weak, F1 exposes it.

This is why the F1 Score is much more honest than accuracy, especially in imbalanced datasets where accuracy can easily look high while the model performs poorly.

Why F1 Score Is Better Than Accuracy (The Real Reason)

Accuracy looks at only one thing:
How many predictions were correct out of the total?

It does not care about:

False positives
False negatives
Class imbalance
The cost of mistakes

This is why accuracy can look “perfect” even when a model is performing terribly in real life.

Example (Fraud Detection):

There are 100 fraud cases in a dataset of 10,000 transactions.
If a model predicts “No fraud” for every transaction”, here’s what happens:

It correctly predicts 9,900 normal transactions → high accuracy
It completely misses all 100 fraud cases

Accuracy = 99%
Performance = 0% useful

Accuracy makes this model look amazing.
But in reality, it’s a complete failure.

This is where F1 Score exposes the truth.

Since the model caught 0 out of 100 fraud cases:

Precision = 0
Recall = 0
F1 Score ≈ 0

And that’s the correct judgment.

In simple words:

Accuracy lies when data is imbalanced.
F1 Score tells the real truth about your model’s performance.

When You Should NOT Use F1 Score (Beginners Never Learn This)

1. When True Negatives Matter

F1 Score completely ignores TN, even though TN can be extremely important.
In systems like spam detection, millions of legitimate emails must be classified correctly.
F1 does not reward you for getting them right meaning it cannot measure overall stability.

2. When Precision or Recall Is More Important Than Balance

Some problems care more about not missing positives → Recall-heavy (e.g., medical tests).
Others care more about avoiding false alarms → Precision-heavy (e.g., credit approval).
F1 treats both as equally important, which may hide real performance differences.

3. When Data Is Extremely Imbalanced

In anomaly detection or rare-event prediction, F1 may become unstable.
Better metrics include:

PR-AUC (Precision–Recall Area Under Curve)
MCC (Matthews Correlation Coefficient)
ROC-AUC (Receiver Operating Characteristic Curve)

These capture imbalance more reliably.

4. When Business Cost Matters

F1 assumes false positives and false negatives have equal importance.
But in the real world, the cost is rarely equal.
Example:

A false positive costs ₹1,000
A false negative costs ₹10,00,000

F1 treats both errors the same, which does not match business reality.
Different tasks require different priorities F1 cannot express that.

F1 Score Ignores True Negatives: Why This Is a Big Deal

TN (True Negative) means:

“The model correctly said NO.”

In many real-world systems, correctly predicting “No” is just as important as predicting “Yes.”

✔ Where TN matters a lot:

Spam filters → Millions of normal emails must be recognized correctly
Intrusion detection → Most network activity is safe
Review moderation → Most comments are not abusive
Sentiment analysis → Most statements are neutral or normal

These systems process huge volumes of “negative” cases, so getting TN right is crucial for stability and user trust.

The Problem

F1 Score ignores TN completely.
It only looks at TP, FP, and FN.

Because of this, two models can have:

very different TN
very different stability
very different user impact

…but still end up with the same F1 Score.

This can make F1 Score misleading for large-scale classification tasks where the majority of data is negative.

The Better Alternatives

Metrics that do consider TN often perform better here:

MCC (Matthews Correlation Coefficient)
ROC-AUC
Balanced Accuracy

These give a more realistic picture when TN plays a major role.

F1 Score Ignores True Negatives

Types of F1 Scores (Macro, Micro, Weighted)

When you move from binary classification to multi-class classification, the F1 Score becomes more interesting because you now have multiple classes and each class may have a different number of samples.
To handle this fairly, we use three variants of F1:

1. Macro F1

Macro F1 calculates the F1 Score for each class separately, then takes the average.

Every class gets equal importance
Even a rare class with only 5 samples counts as much as a class with 5,000 samples
Great when your dataset is balanced and you want fairness across all classes

Use Macro F1 when:
→ You want to treat every class equally, regardless of size.

2. Micro F1

Micro F1 aggregates all TP, FP, and FN across all classes before calculating F1.

Counts each individual prediction equally
Larger classes naturally influence the score more
Best for multi-label problems, where multiple labels can be true at once

Use Micro F1 when:
→ You want a global measure of performance that reflects overall prediction accuracy.

3. Weighted F1

Weighted F1 is similar to Macro F1, but each class’s F1 Score is weighted by how many samples it has.

Larger classes get more weight
Prevents small classes from dominating the macro average
Best for imbalanced multi-class datasets, where some classes appear rarely

Use Weighted F1 when:
→ You want a fair metric but still want the score to reflect class distribution.

F1 vs Threshold (Why 0.5 is Usually Wrong)

A classification model doesn’t directly output “Yes” or “No.”
It outputs a probability for example:

0.92 chance of fraud
0.37 chance of spam
0.18 chance of positive sentiment

We then choose a threshold (usually 0.5) to convert that probability into a prediction.

✔ Predicted “Yes” if probability ≥ threshold

✔ Predicted “No” if probability < threshold

But here’s the important part:

Changing the threshold changes your model’s behavior.

When you lower the threshold:

Model predicts “Yes” more often
Recall increases (you catch more positives)
Precision decreases (more false positives)

When you raise the threshold:

Model predicts “Yes” less often
Precision increases (fewer false alarms)
Recall decreases (you miss more positives)

Because the F1 Score depends on both precision and recall,
the highest F1 Score usually appears at a threshold between 0.2 and 0.4 not at the default 0.5.

This is why many beginners struggle:

They leave the threshold at 0.5 and assume the model is bad…
when the real issue is simply a poorly tuned threshold.

Real-World Mini Case Studies (Simple + Relatable)

Case 1: Medical Diagnosis (Recall matters most)

In healthcare, missing a positive case is the worst possible mistake.
If a model fails to detect a disease (false negative), the consequences can be deadly.

High recall = fewer missed patients
Precision is still important, but secondary
F1 Score helps measure overall balance, but recall is the real priority

Why?
A false alarm can be retested.
A missed case might not get a second chance.

Case 2: Fraud Detection

Banks want to catch fraud, but they must avoid blocking real customers.
A false positive here means a genuine user gets flagged, causing frustration and financial loss.

High precision = fewer innocent customers blocked
Recall still matters, but cannot come at the cost of too many false alarms
F1 Score gives a balanced view of both sides

Why?
A customer falsely flagged for fraud may lose trust immediately.

Case 3: Spam Classification

Most emails are legitimate (true negatives).
A good spam filter must correctly identify millions of “normal” emails every day.

But:

F1 Score ignores true negatives
So it cannot fully represent the performance of a spam classifier
Metrics like Balanced Accuracy, MCC, or ROC-AUC offer a more complete evaluation

Why?
If F1 is high but TN is poor, your inbox will fill with wrongly filtered emails a very bad user experience.

Python Code Example (Beginner Friendly)

from sklearn.metrics import precision_score, recall_score, f1_score, classification_report

from sklearn.model_selection import train_test_split

from sklearn.datasets import make_classification

from sklearn.ensemble import RandomForestClassifier

X, y = make_classification(n_samples=2000, weights=[0.95], n_classes=2)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = RandomForestClassifier()

model.fit(X_train, y_train)

pred = model.predict(X_test)

print("Precision:", precision_score(y_test, pred))

print("Recall:", recall_score(y_test, pred))

print("F1 Score:", f1_score(y_test, pred))

print(classification_report(y_test, pred))

Then explain what the output means:

Precision tells how accurate positive predictions are
Recall tells how many positives we caught
F1 gives a balanced score
classification_report shows micro/macro/weighted F1

Common Beginner Mistakes

1. Relying Only on Accuracy

Accuracy looks good even when the model is failing, especially with imbalanced datasets.
If you trust accuracy alone, you will almost always overestimate your model’s performance.

2. Using F1 Without Checking the Confusion Matrix

F1 is just a summary.
Without the confusion matrix, you cannot see where the model is going wrong FP, FN, TP, TN all matter.
The confusion matrix always tells the real story.

3. Not Tuning the Threshold

Most models default to a 0.5 threshold, but this rarely gives the best F1 Score.
If you never tune the threshold, you’re not seeing your model’s true potential.

4. Using Macro F1 for Imbalanced Data

Macro F1 treats all classes equally even rare ones.
On heavily imbalanced datasets, this will distort performance and give a misleading score.

5. Comparing F1 Scores Across Different Datasets

An F1 Score of 0.70 on one dataset may be excellent, but terrible on another.
F1 is not an absolute number it depends on the dataset and the problem.

6. Ignoring Business Cost

F1 assumes false positives and false negatives are equally bad.
In reality, one mistake may cost ₹1,000 while the other may cost ₹10,00,000.
Business context matters more than the metric.

How to Improve F1 Score (Actionable Tips)

Improving the F1 Score isn’t just about changing models it’s about improving how the model learns and predicts. Small adjustments can make your precision and recall work together more effectively.

✔ Tune the Classification Threshold

Even a small shift in the decision threshold can dramatically change precision, recall, and ultimately the F1 Score.
Most models improve instantly once you move away from the default 0.5.

✔ Resample the Dataset

Class imbalance is the enemy of a good F1 Score.
To fix this, you can use:

SMOTE (synthetic minority oversampling)
Undersampling the majority class
Class weights in your model
These methods help the model understand minority classes better.

✔ Improve Feature Engineering

Better features create clearer separation between classes.
This reduces confusion and directly boosts both precision and recall leading to a higher F1 Score.

✔ Tune Hyperparameters

Using techniques like grid search or random search can significantly improve model performance.
Well-tuned models make fewer mistakes and achieve a more balanced F1 Score.

✔ Collect More Data (Especially Minority Class)

If the minority class has too few examples, the model cannot learn its patterns properly.
Adding more samples even 20% more can lead to a noticeable improvement in F1.

What Is a “Good” F1 Score?

There is no single F1 Score that counts as “good” for every situation.
Different fields have different levels of acceptable risk, error tolerance, and data quality so the meaning of a “good” F1 Score changes from one problem to another.

Medical Field

In healthcare, mistakes can be life-threatening.
Models must be extremely reliable, so an F1 Score above 0.90 is generally expected.

Fraud Detection

Fraud cases are rare, noisy, and difficult to predict perfectly.
Typical real-world systems achieve an F1 Score between 0.70 and 0.85.

NLP / Sentiment Analysis

Text data is easier for models to learn patterns from.
Good models often reach 0.80+ F1 Scores, especially on clean datasets.

Real-World Messy Data

When the dataset is full of noise, missing values, imbalance, or imperfect labels, even an F1 Score of 0.60 may be a meaningful win.

Final Summary

Accuracy is not enough
F1 Score balances precision & recall
F1 is perfect for imbalanced datasets
But not useful when TN matters
Thresholds affect F1 heavily
Macro/micro/weighted F1 help in multi-class models
Other metrics sometimes outperform F1
Beginners often misuse F1
You now understand F1 better than most beginners

If you're learning machine learning, the F1 Score will be one of your best friends especially when accuracy tries to fool you.

Tags:

Ethics of Artificial Intelligence: What We’re Losing

hans volkers Hans Volkers, a managing director with 40 years of experience, is highly respected for his expertise and leadership. Throughout his career, he has effectively applied data-driven strategies to drive organizational success. His deep commitment to ethical practices and his authoritative knowledge have made him a trusted leader, perfectly embodying the principles of expertise, authoritativeness, and trustworthiness.