Chapter 2: Stochastic Gradient Descent

By Tomas Beuzen 🚀

Chapter Learning Objectives

  • Explain and implement the stochastic gradient descent algorithm.

  • Explain the advantages and disadvantages of stochastic gradient descent as compared to gradient descent.

  • Explain what are epochs, batch sizes, iterations, and computations in the context of gradient descent and stochastic gradient descent.


import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from utils.plotting import *

1. Motivation for Stochastic Gradient Descent

Last chapter we looked at “vanilla” gradient descent. Almost all loss functions you’ll use in ML involve a sum over all the (training) data, e.g., mean squared error:


Here \(f(w)\) is the value of the loss function, \(h_w(x)\) is the model we wish to fit, e.g., for linear regression: \(h_w(x)=w^Tx\). The goal is to find the weights \(\boldsymbol{w}\) that minimize the loss function. With gradient descent we:

  1. Start with some arbitrary \(\boldsymbol{w}\)

  2. Calculate the gradient using all training examples

  3. Use the gradient to adjust \(\boldsymbol{w}\)

  4. Repeat for \(I\) iterations or until the step-size is sufficiently small

But if we have large \(n\) and/or \(\boldsymbol{w}\) vanilla gradient descent becomes very computationally expensive (when we get to deep learning, we’ll have models where the number of weights to optimize is in the millions!). Say n = 1,000,000, we have 1000 parameters to optimize, and we do 1000 iterations - that’s \(10^{12}\) computations!

We can reduce this workload by using just a fraction of our dataset to update our parameters each iteration (rather than using the whole data set). This is called “stochastic gradient descent”.

2. Stochastic Gradient Descent

Gradient Descent:

\[w_{j+1}=w_{j}-\alpha_t\frac{\partial}{\partial w_j}f(w_j)\]

Stochastic Gradient Descent:

\[w_{j+1}=w_{j}-\alpha_t\frac{\partial}{\partial w_j}f_i(w_j)\]

Pretty simple! What exactly is the difference in the above equations? Well one of them includes a subscript \(i\). That means, instead of updating our parameters based on a gradient calculated using all training data, we simply use one of our data points (the \(i\)th one). This is best seen by example, let’s use the Pokemon dataset from last chapter:

df = (pd.read_csv("data/pokemon.csv", usecols=['name', 'defense', 'attack'], index_col=0)
x = StandardScaler().fit_transform(df[['defense']]).flatten()
y = df['attack'].to_numpy()
plot_pokemon(x, y, x_range=[0, 1], y_range=[0, 200], dx=0.2, dy=50)