 # Chapter 2: Stochastic Gradient Descent¶

By Tomas Beuzen 🚀 ## Chapter Learning Objectives¶

• Explain and implement the stochastic gradient descent algorithm.

• Explain the advantages and disadvantages of stochastic gradient descent as compared to gradient descent.

• Explain what are epochs, batch sizes, iterations, and computations in the context of gradient descent and stochastic gradient descent.

## Imports¶

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from utils.plotting import *


## 1. Motivation for Stochastic Gradient Descent¶

Last chapter we looked at “vanilla” gradient descent. Almost all loss functions you’ll use in ML involve a sum over all the (training) data, e.g., mean squared error:

$f(w)=\frac{1}{n}\sum^{n}_{i=1}(h_w(x_i)-y_i)^2$

Here $$f(w)$$ is the value of the loss function, $$h_w(x)$$ is the model we wish to fit, e.g., for linear regression: $$h_w(x)=w^Tx$$. The goal is to find the weights $$\boldsymbol{w}$$ that minimize the loss function. With gradient descent we:

1. Start with some arbitrary $$\boldsymbol{w}$$

2. Calculate the gradient using all training examples

3. Use the gradient to adjust $$\boldsymbol{w}$$

4. Repeat for $$I$$ iterations or until the step-size is sufficiently small

But if we have large $$n$$ and/or $$\boldsymbol{w}$$ vanilla gradient descent becomes very computationally expensive (when we get to deep learning, we’ll have models where the number of weights to optimize is in the millions!). Say n = 1,000,000, we have 1000 parameters to optimize, and we do 1000 iterations - that’s $$10^{12}$$ computations!

We can reduce this workload by using just a fraction of our dataset to update our parameters each iteration (rather than using the whole data set). This is called “stochastic gradient descent”.

## 2. Stochastic Gradient Descent¶

$w_{j+1}=w_{j}-\alpha_t\frac{\partial}{\partial w_j}f(w_j)$

$w_{j+1}=w_{j}-\alpha_t\frac{\partial}{\partial w_j}f_i(w_j)$
Pretty simple! What exactly is the difference in the above equations? Well one of them includes a subscript $$i$$. That means, instead of updating our parameters based on a gradient calculated using all training data, we simply use one of our data points (the $$i$$th one). This is best seen by example, let’s use the Pokemon dataset from last chapter:
df = (pd.read_csv("data/pokemon.csv", usecols=['name', 'defense', 'attack'], index_col=0)