# Part 5: Striking the Balance — Understanding Underfitting and Overfitting in Linear Models

[**In Part 4**](https://abhis-space.hashnode.dev/part-4-linear-regression-key-techniques-for-better-model-performance?source=more_series_bottom_blogs), we focused on improving our model. But how do we know if it’s too weak or too aggressive?  
In this final post of the series, we’ll explain **underfitting**, **overfitting**, and the **bias-variance tradeoff** — one of the most important ideas in machine learning.

We’ll learn how to visualize it, fix it, and answer questions about it in interviews.

## Introduction

When building machine learning models, there are two classic traps that even seasoned data scientists can fall into: **underfitting** and **overfitting**. These two issues can silently ruin a model’s performance, yet they are some of the most intuitive concepts once you get the hang of them.

Here, we’ll break down underfitting and overfitting with:

* Simple definitions and metaphors
    
* Hands-on code and visualizations (using Python & NumPy)
    
* How to detect and fix both problems
    
* A final checklist to evaluate if our model is in the sweet spot
    

Whether we're just starting out or brushing up on fundamentals, this guide will give us a solid understanding.

## The Big Picture: What Are We Trying to Do?

When we train a machine learning model, our goal is to **learn patterns from data** that generalize well to new, unseen data.

Imagine we're tutoring a student. We want them to understand the concept (generalization), not just memorize answers to specific questions (overfitting) or misunderstand everything (underfitting).

## What is Underfitting?

**Definition:** A model is said to be underfitting when it is **too simple** to capture the underlying trend in the data.

#### Symptoms:

* High training error
    
* High test error
    
* Poor performance on both seen and unseen data
    

#### Analogy:

Imagine fitting a straight line through data that clearly forms a curve. Our model is too naive to catch what’s really happening.

#### Causes:

* Model is too simple (e.g., linear model for nonlinear data)
    
* Not enough training time (early stopping)
    
* Poor features
    

## What is Overfitting?

**Definition:** A model overfits when it **memorizes the training data**, including noise and outliers, and fails to generalize to new data.

#### Symptoms:

* Very low training error
    
* Very high test error
    

#### Analogy:

Imagine a student who memorizes every answer from the practice test. When they see a new question in the exam, they panic.

#### Causes:

* Model is too complex (e.g., very deep tree, high-degree polynomial)
    
* Too many parameters for the size of the data
    
* Noisy training data
    
* Insufficient regularization
    

## Bias-Variance Tradeoff

**Understanding the Theory Behind the Balance**

While it's easy to grasp underfitting and overfitting visually, there's a deeper concept that unites them: the **bias-variance tradeoff**. This tradeoff helps explain *why* models behave the way they do as complexity changes.

**Definition of Bias (in Machine Learning): Bias** refers to the error introduced by approximating a complex problem with a simplified model. In simpler terms, it’s when a model **ignores key patterns** because it makes strong assumptions.

#### High Bias → Underfitting

* Happens when the model is **too simple** to capture patterns in the data.
    
* Tends to make **strong assumptions** about the data (e.g., assuming all relationships are linear).
    
* Leads to **consistently poor predictions**, both on training and test sets.
    

> Think of a student who didn’t study enough and tries to guess every answer based on a single rule — they’re wrong most of the time.

**Definition of Variance (in Machine Learning): Variance** measures how sensitive a model is to slight changes in the training data. It reflects how much predictions would **change** if trained on a different sample from the same source.

#### High Variance → Overfitting

* Occurs when the model is **too complex** and tries to fit every detail of the training data, including noise.
    
* Sensitive to even slight changes in the data.
    
* Performs well on training data but poorly on unseen data.
    

> Like a student who memorizes every question on a practice test — they fail when the test format changes slightly.

#### The Ideal Zone: Balance

* A good model strikes a **balance between bias and variance**.
    
* It is **complex enough** to capture patterns, but **simple enough** to ignore noise.
    
* This sweet spot often lies somewhere in the **middle of the complexity spectrum**.
    

> 📌 Rule of Thumb: Increasing model complexity reduces bias but increases variance. The goal is to minimize **total error**, which comes from both.

$$\text{Total Error} = \underbrace{\text{Bias}^2}{\text{error from wrong assumptions}} + \underbrace{\text{Variance}}{\text{error from overreacting to noise}} + \text{Irreducible Error}$$

## Visualizing the Problem

Let’s use Python and NumPy to simulate and visualize:

```python
import numpy as np
import matplotlib.pyplot as plt

# Synthetic dataset
np.random.seed(1)
x = np.linspace(0, 10, 20)
y = 3 * x**2 + 2 * x + 1 + np.random.randn(20) * 15

# Fit & predict function
def fit_predict(x, y, degree):
    coeffs = np.polyfit(x, y, degree)
    x_line = np.linspace(min(x), max(x), 200)
    y_line = np.polyval(coeffs, x_line)
    return x_line, y_line

# Plot
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for i, deg in enumerate([1, 2, 15]):
    x_line, y_line = fit_predict(x, y, deg)
    axes[i].scatter(x, y, color='blue', label='Data')
    axes[i].plot(x_line, y_line, color='red', label=f'Degree {deg}')
    axes[i].set_title(['Underfitting', 'Good Fit', 'Overfitting'][i])
    axes[i].legend()
    axes[i].grid(True)
plt.tight_layout()
plt.show()
```

This code shows:

* A linear model struggling to capture the pattern (underfit)
    
* A quadratic model doing well (good fit)
    
* A complex polynomial model that zigzags wildly (overfit)
    

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1753897561915/efe59a93-66d3-46b6-bb4b-40010a0683d4.png align="center")

### Training vs Validation Curve Plot

```python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Synthetic dataset
np.random.seed(1)
x = np.linspace(0, 10, 20)
y = 3 * x**2 + 2 * x + 1 + np.random.randn(20) * 15

# Reshape and split
x = x.reshape(-1, 1)
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.3, random_state=42)

train_errors = []
val_errors = []
degrees = range(1, 16)

for d in degrees:
    coeffs = np.polyfit(x_train.flatten(), y_train, d)
    model = np.poly1d(coeffs)
    y_train_pred = model(x_train.flatten())
    y_val_pred = model(x_val.flatten())
    
    train_errors.append(mean_squared_error(y_train, y_train_pred))
    val_errors.append(mean_squared_error(y_val, y_val_pred))

# Plotting
plt.figure(figsize=(10, 5))
plt.plot(degrees, train_errors, label='Training Error', marker='o')
plt.plot(degrees, val_errors, label='Validation Error', marker='o')
plt.xlabel('Model Complexity (Polynomial Degree)')
plt.ylabel('Mean Squared Error')
plt.title('Bias-Variance Tradeoff: Error vs. Model Complexity')
plt.legend()
plt.grid(True)
plt.tight_layout()
```

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1753906442310/23374444-7a1e-4517-be8f-60784cec2ec4.png align="center")

the chart we've generated is a **Bias-Variance Tradeoff visualization**, showing how **model complexity** (polynomial degree) affects **training and validation error**.

> **X-axis**: `Model Complexity`, represented by the degree of the polynomial (from 1 to 15).
> 
> **Y-axis**: `Mean Squared Error (MSE)` — lower is better.
> 
> **Blue Line**: `Training Error` — how well the model fits the data it was trained on.
> 
> **Orange Line:** `Validation Error` — how well the model performs on unseen data.

**Interpretation of the Plot: Error vs Model Complexity**

This chart shows how model performance changes as we increase complexity by using **higher-degree polynomials** (from 1 to 15):

**Degrees 1–11 – Sweet Spot or Data Quirk?**

* Both **training and validation errors are very low and nearly equal**.
    
* At first glance, this looks like we’ve **nailed the sweet spot** — the model is generalizing well.
    
* However, with such **consistently low error across degrees**, it's worth asking:
    
    > *“Is the dataset too small or too easy?”*
    
* This could happen if:
    
    * The data has a strong, clean pattern.
        
    * We have **too few data points** (e.g., only 20 samples).
        
    * Even simple models can perfectly fit it — which means **true underfitting is hard to visualize** here.
        

**Degrees 12–15 – Clear Overfitting Zone**

* **Validation error spikes dramatically**, while **training error stays very low**.
    
* This is **classic overfitting**:
    
    * The model starts to memorize every tiny fluctuation in training data — even noise.
        
    * It loses the ability to generalize to unseen data.
        
* This is a clear sign of **high variance**.
    

**What This Tells Us (for Linear Regression Learners)**

* As we increase model complexity:
    
    * **Training error always goes down** (we can always memorize more).
        
    * **Validation error decreases up to a point**, then **increases again** — forming the classic **U-shaped curve**.
        
* The goal is to stop at the **lowest point of validation error** — that’s your sweet spot.
    

### Conclusion

> Even with linear regression, when extended via **polynomial features**, it’s possible to **overfit**.  
> This plot helps us visually detect when our model is becoming **too complex** for the data it’s learning from.

## Detecting Underfitting & Overfitting

Use a **training vs. validation error curve**:

| **Aspect** | **Underfitting** | **Overfitting** |
| --- | --- | --- |
| Training Error | High | Very Low |
| Test Error | High | High |
| Model Type | Too Simple | Too Complex |
| Generalization | Poor on both seen and unseen data | Poor on unseen data |
| Fixes | Increase complexity, add features | Regularization, simplify, more data |

## Remedies and Fixes

#### To Fix Underfitting:

* Use a more complex model
    
* Add more features or transformations
    
* Reduce regularization (We will come to this later)
    
* Train longer
    

#### To Fix Overfitting:

* Simplify the model (fewer parameters)
    
* Use regularization (L1, L2)
    
* Get more data
    
* Use dropout, for neural networks. (We will come to this later)
    
* Use cross-validation
    

## Bonus: A Real-World Example

Let’s say we’re predicting exam scores based on hours studied. Our dataset:

| **Hours Studied (x)** | **Actual Score (y)** |
| --- | --- |
| 0 | 42 |
| 1 | 47 |
| 2 | 53 |
| 3 | 58 |
| 4 | 67 |

If our predicted values were: 40, 45, 50, 55, 60 → we’d see **residuals** increasing (underfitting).  
If they were: 42, 47, 53, 58, 67 → perfect predictions (possibly overfitting unless this generalizes well).

## Quick Flashcards

**Q:** What is underfitting?  
**A:** When the model is too simple to learn the data's structure — high training and test error.

**Q:** What is overfitting?  
**A:** When the model memorizes the training data, including noise — low train error, high test error.

**Q:** What causes overfitting?  
**A:** Too complex model, too many parameters, noisy data, not enough regularization.

**Q:** What is the bias-variance tradeoff?  
**A:** It's the balance between underfitting (high bias) and overfitting (high variance) to minimize total error.

**Q:** How can you fix underfitting?  
**A:** Use a more complex model, train longer, improve features, reduce regularization.

**Q:** How can you fix overfitting?  
**A:** Use regularization, collect more data, simplify the model, or use dropout (in neural networks).

## Conclusion

> Understanding underfitting and overfitting is a foundational skill in machine learning. We don’t need to be a math genius to recognize them. We just need to:
> 
> * Visualize often
>     
> * Track performance on both training and test sets
>     
> * Tweak your models thoughtfully
>     
> 
> Once we develop the intuition, spotting these patterns becomes second nature.

## What’s next?

We’ve now completed the core 5-part series on linear regression and supervised learning! What’s next? **Regularization** — our tool to tame overfitting without losing performance. Stay tuned for the next post, where we’ll explore **Ridge and Lasso regression**, and how to choose the right complexity automatically.  
  
*Make your models robust and reliable.*