F1 Score in Machine Learning
Learn the F1 Score in simple, beginner-friendly language. Understand precision, recall, examples, mistakes, and when to use F1 in machine learning.
Why Accuracy Isn’t Enough (The Truth No One Tells Beginners)
Imagine you build a model that predicts whether a bank transaction is fraudulent.
Out of 10,000 transactions, only 10 are actually fraud.
If your model simply predicts “No fraud” every single time, guess what?
You get 99.9% accuracy.
Amazing, right?
No. That model is completely useless.
This is the moment every beginner realizes:
-
Accuracy only works when data is balanced.
-
When classes are uneven, accuracy lies.
This is where the F1 Score enters the hero metric that saves you from misleading numbers.
The Confusion Matrix (A Simple Table That Explains Everything)
Every classification model’s performance depends on four numbers:
|
Term |
Meaning (Simple Explanation) |
|
TP (True Positive) |
Model correctly predicted “Yes” |
|
FP (False Positive) |
Model predicted “Yes” but it was “No” |
|
FN (False Negative) |
Model predicted “No” but it was “Yes” |
|
TN (True Negative) |
Model correctly predicted “No” |
Think of a medical test:
-
TP: Sick person correctly identified
-
FN: Sick person missed → very dangerous
-
FP: Healthy person incorrectly told they’re sick → panic, retest
-
TN: Healthy person correctly identified
Everything we learn next Precision, Recall, and F1 comes from these four values.
Precision and Recall: The Two Most Important Words in ML
Precision → “When I predict Positive, how often am I right?”
Example: A spam filter flags 100 emails as spam.
If 90 are actually spam → High precision.
Recall → “How many actual positives did I catch?”
If 100 spam emails exist and the filter catches only 50 → recall = 0.5.
Both matter but both can fail alone.
You can have:
-
High precision, low recall
-
High recall, low precision
F1 Score fixes this imbalance.
What Is the F1 Score? (Zero Jargon)
The F1 Score is designed to solve a problem that accuracy can never fix:
a model can be “accurate” but still completely useless on imbalanced data.
Precision and recall each tell only half the story:
-
High recall, low precision → You catch many positives but also make many wrong predictions.
-
High precision, low recall → You avoid mistakes but miss many real positives.
Both situations are bad in real-world machine learning.
This is why the F1 Score combines precision and recall into ONE balanced number.
It forces your model to be both:
✔ good at catching positives
✔ good at being correct when predicting positives
In simple words:
F1 Score tells you how well your model really performs when accuracy fails—especially when your dataset is imbalanced.
The F1 Formula (Explained Without Fear)
The F1 Score is calculated as:
F1 = 2 × (Precision × Recall) / (Precision + Recall)
It uses the harmonic mean, not the normal average, for a very important reason:
The harmonic mean punishes imbalance.
If precision is high but recall is low, or the other way around, the F1 Score drops sharply.
Example:
-
Precision = 1.0
-
Recall = 0
Even though precision is perfect, recall is zero so:
F1 Score = 0
This tells you instantly:
“The model is not actually performing well.”
The harmonic mean forces both precision and recall to be good at the same time.
✔ If one value collapses, F1 collapses.
✔ If one value is weak, F1 exposes it.
This is why the F1 Score is much more honest than accuracy, especially in imbalanced datasets where accuracy can easily look high while the model performs poorly.
Why F1 Score Is Better Than Accuracy (The Real Reason)
Accuracy looks at only one thing:
How many predictions were correct out of the total?
It does not care about:
-
False positives
-
False negatives
-
Class imbalance
-
The cost of mistakes
This is why accuracy can look “perfect” even when a model is performing terribly in real life.
Example (Fraud Detection):
There are 100 fraud cases in a dataset of 10,000 transactions.
If a model predicts “No fraud” for every transaction”, here’s what happens:
-
It correctly predicts 9,900 normal transactions → high accuracy
-
It completely misses all 100 fraud cases
Accuracy = 99%
Performance = 0% useful
Accuracy makes this model look amazing.
But in reality, it’s a complete failure.
This is where F1 Score exposes the truth.
Since the model caught 0 out of 100 fraud cases:
-
Precision = 0
-
Recall = 0
-
F1 Score ≈ 0
And that’s the correct judgment.
In simple words:
Accuracy lies when data is imbalanced.
F1 Score tells the real truth about your model’s performance.
When You Should NOT Use F1 Score (Beginners Never Learn This)
1. When True Negatives Matter
F1 Score completely ignores TN, even though TN can be extremely important.
In systems like spam detection, millions of legitimate emails must be classified correctly.
F1 does not reward you for getting them right meaning it cannot measure overall stability.
2. When Precision or Recall Is More Important Than Balance
Some problems care more about not missing positives → Recall-heavy (e.g., medical tests).
Others care more about avoiding false alarms → Precision-heavy (e.g., credit approval).
F1 treats both as equally important, which may hide real performance differences.
3. When Data Is Extremely Imbalanced
In anomaly detection or rare-event prediction, F1 may become unstable.
Better metrics include:
-
PR-AUC (Precision–Recall Area Under Curve)
-
MCC (Matthews Correlation Coefficient)
-
ROC-AUC (Receiver Operating Characteristic Curve)
These capture imbalance more reliably.
4. When Business Cost Matters
F1 assumes false positives and false negatives have equal importance.
But in the real world, the cost is rarely equal.
Example:
-
A false positive costs ₹1,000
-
A false negative costs ₹10,00,000
F1 treats both errors the same, which does not match business reality.
Different tasks require different priorities F1 cannot express that.
F1 Score Ignores True Negatives: Why This Is a Big Deal
TN (True Negative) means:
“The model correctly said NO.”
In many real-world systems, correctly predicting “No” is just as important as predicting “Yes.”
✔ Where TN matters a lot:
-
Spam filters → Millions of normal emails must be recognized correctly
-
Intrusion detection → Most network activity is safe
-
Review moderation → Most comments are not abusive
-
Sentiment analysis → Most statements are neutral or normal
These systems process huge volumes of “negative” cases, so getting TN right is crucial for stability and user trust.
The Problem
F1 Score ignores TN completely.
It only looks at TP, FP, and FN.
Because of this, two models can have:
-
very different TN
-
very different stability
-
very different user impact
…but still end up with the same F1 Score.
This can make F1 Score misleading for large-scale classification tasks where the majority of data is negative.
The Better Alternatives
Metrics that do consider TN often perform better here:
-
MCC (Matthews Correlation Coefficient)
-
ROC-AUC
-
Balanced Accuracy
These give a more realistic picture when TN plays a major role.
Types of F1 Scores (Macro, Micro, Weighted)
When you move from binary classification to multi-class classification, the F1 Score becomes more interesting because you now have multiple classes and each class may have a different number of samples.
To handle this fairly, we use three variants of F1:
1. Macro F1
Macro F1 calculates the F1 Score for each class separately, then takes the average.
-
Every class gets equal importance
-
Even a rare class with only 5 samples counts as much as a class with 5,000 samples
-
Great when your dataset is balanced and you want fairness across all classes
Use Macro F1 when:
→ You want to treat every class equally, regardless of size.
2. Micro F1
Micro F1 aggregates all TP, FP, and FN across all classes before calculating F1.
-
Counts each individual prediction equally
-
Larger classes naturally influence the score more
-
Best for multi-label problems, where multiple labels can be true at once
Use Micro F1 when:
→ You want a global measure of performance that reflects overall prediction accuracy.
3. Weighted F1
Weighted F1 is similar to Macro F1, but each class’s F1 Score is weighted by how many samples it has.
-
Larger classes get more weight
-
Prevents small classes from dominating the macro average
-
Best for imbalanced multi-class datasets, where some classes appear rarely
Use Weighted F1 when:
→ You want a fair metric but still want the score to reflect class distribution.
F1 vs Threshold (Why 0.5 is Usually Wrong)
A classification model doesn’t directly output “Yes” or “No.”
It outputs a probability for example:
-
0.92 chance of fraud
-
0.37 chance of spam
-
0.18 chance of positive sentiment
We then choose a threshold (usually 0.5) to convert that probability into a prediction.
✔ Predicted “Yes” if probability ≥ threshold
✔ Predicted “No” if probability < threshold
But here’s the important part:
Changing the threshold changes your model’s behavior.
When you lower the threshold:
-
Model predicts “Yes” more often
-
Recall increases (you catch more positives)
-
Precision decreases (more false positives)
When you raise the threshold:
-
Model predicts “Yes” less often
-
Precision increases (fewer false alarms)
-
Recall decreases (you miss more positives)
Because the F1 Score depends on both precision and recall,
the highest F1 Score usually appears at a threshold between 0.2 and 0.4 not at the default 0.5.
This is why many beginners struggle:
They leave the threshold at 0.5 and assume the model is bad…
when the real issue is simply a poorly tuned threshold.
Real-World Mini Case Studies (Simple + Relatable)
Case 1: Medical Diagnosis (Recall matters most)
In healthcare, missing a positive case is the worst possible mistake.
If a model fails to detect a disease (false negative), the consequences can be deadly.
-
High recall = fewer missed patients
-
Precision is still important, but secondary
-
F1 Score helps measure overall balance, but recall is the real priority
Why?
A false alarm can be retested.
A missed case might not get a second chance.
Case 2: Fraud Detection
Banks want to catch fraud, but they must avoid blocking real customers.
A false positive here means a genuine user gets flagged, causing frustration and financial loss.
-
High precision = fewer innocent customers blocked
-
Recall still matters, but cannot come at the cost of too many false alarms
-
F1 Score gives a balanced view of both sides
Why?
A customer falsely flagged for fraud may lose trust immediately.
Case 3: Spam Classification
Most emails are legitimate (true negatives).
A good spam filter must correctly identify millions of “normal” emails every day.
But:
-
F1 Score ignores true negatives
-
So it cannot fully represent the performance of a spam classifier
-
Metrics like Balanced Accuracy, MCC, or ROC-AUC offer a more complete evaluation
Why?
If F1 is high but TN is poor, your inbox will fill with wrongly filtered emails a very bad user experience.
Python Code Example (Beginner Friendly)
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
X, y = make_classification(n_samples=2000, weights=[0.95], n_classes=2)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = RandomForestClassifier()
model.fit(X_train, y_train)
pred = model.predict(X_test)
print("Precision:", precision_score(y_test, pred))
print("Recall:", recall_score(y_test, pred))
print("F1 Score:", f1_score(y_test, pred))
print(classification_report(y_test, pred))
Then explain what the output means:
-
Precision tells how accurate positive predictions are
-
Recall tells how many positives we caught
-
F1 gives a balanced score
-
classification_report shows micro/macro/weighted F1
Common Beginner Mistakes
1. Relying Only on Accuracy
Accuracy looks good even when the model is failing, especially with imbalanced datasets.
If you trust accuracy alone, you will almost always overestimate your model’s performance.
2. Using F1 Without Checking the Confusion Matrix
F1 is just a summary.
Without the confusion matrix, you cannot see where the model is going wrong FP, FN, TP, TN all matter.
The confusion matrix always tells the real story.
3. Not Tuning the Threshold
Most models default to a 0.5 threshold, but this rarely gives the best F1 Score.
If you never tune the threshold, you’re not seeing your model’s true potential.
4. Using Macro F1 for Imbalanced Data
Macro F1 treats all classes equally even rare ones.
On heavily imbalanced datasets, this will distort performance and give a misleading score.
5. Comparing F1 Scores Across Different Datasets
An F1 Score of 0.70 on one dataset may be excellent, but terrible on another.
F1 is not an absolute number it depends on the dataset and the problem.
6. Ignoring Business Cost
F1 assumes false positives and false negatives are equally bad.
In reality, one mistake may cost ₹1,000 while the other may cost ₹10,00,000.
Business context matters more than the metric.
How to Improve F1 Score (Actionable Tips)
Improving the F1 Score isn’t just about changing models it’s about improving how the model learns and predicts. Small adjustments can make your precision and recall work together more effectively.
✔ Tune the Classification Threshold
Even a small shift in the decision threshold can dramatically change precision, recall, and ultimately the F1 Score.
Most models improve instantly once you move away from the default 0.5.
✔ Resample the Dataset
Class imbalance is the enemy of a good F1 Score.
To fix this, you can use:
-
SMOTE (synthetic minority oversampling)
-
Undersampling the majority class
-
Class weights in your model
These methods help the model understand minority classes better.
✔ Improve Feature Engineering
Better features create clearer separation between classes.
This reduces confusion and directly boosts both precision and recall leading to a higher F1 Score.
✔ Tune Hyperparameters
Using techniques like grid search or random search can significantly improve model performance.
Well-tuned models make fewer mistakes and achieve a more balanced F1 Score.
✔ Collect More Data (Especially Minority Class)
If the minority class has too few examples, the model cannot learn its patterns properly.
Adding more samples even 20% more can lead to a noticeable improvement in F1.
What Is a “Good” F1 Score?
There is no single F1 Score that counts as “good” for every situation.
Different fields have different levels of acceptable risk, error tolerance, and data quality so the meaning of a “good” F1 Score changes from one problem to another.
Medical Field
In healthcare, mistakes can be life-threatening.
Models must be extremely reliable, so an F1 Score above 0.90 is generally expected.
Fraud Detection
Fraud cases are rare, noisy, and difficult to predict perfectly.
Typical real-world systems achieve an F1 Score between 0.70 and 0.85.
NLP / Sentiment Analysis
Text data is easier for models to learn patterns from.
Good models often reach 0.80+ F1 Scores, especially on clean datasets.
Real-World Messy Data
When the dataset is full of noise, missing values, imbalance, or imperfect labels, even an F1 Score of 0.60 may be a meaningful win.
Final Summary
-
Accuracy is not enough
-
F1 Score balances precision & recall
-
F1 is perfect for imbalanced datasets
-
But not useful when TN matters
-
Thresholds affect F1 heavily
-
Macro/micro/weighted F1 help in multi-class models
-
Other metrics sometimes outperform F1
-
Beginners often misuse F1
-
You now understand F1 better than most beginners
If you're learning machine learning, the F1 Score will be one of your best friends especially when accuracy tries to fool you.
