Chapter 1: Optimization & Gradient Descent

By Tomas Beuzen 🚀

Chapter Learning Objectives


  • Explain the difference between a model, loss function, and optimization algorithm in the context of machine learning.

  • Explain how the gradient descent algorithm works.

  • Apply gradient descent to linear and logistic regression.

  • Use scipy.optimize.minimize() to minimize a function.

Imports


import numpy as np
import pandas as pd
from scipy.optimize import minimize
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from utils.plotting import *

1. Optimization and Machine Learning


In data science and computer science, we optimize a lot of stuff. For example, in linear regression we optimize for the intercept and coefficients of our model, in clustering algorithms like k-means we optimize our clusters, in neural networks we optimize the weights in our network (more on that in a later chapter!), etc.

In one sentence, “optimization” simply refers to minimizing/maximizing a function. For example, what value of \(x\) minimizes the function \(f(x) = (x-2)^2 + 5\)? What is the minimum value? Answers: \(x=2\), and \(f(x)=5\).

If you’re reading this, you’re likely already familiar with machine learning. You can start to think of machine learning as a three-step process:

  1. Choose your model: controls the space of possible functions that map \(X\) to \(y\) (e.g., a linear model can only learn linear functions)

  2. Choose your loss function: tells us how to compare these various functions (e.g., is \(y=5 + 2x_1+3x_2\) a better model than \(y=1 + 10x_1-x_2\)?)

  3. Choose your optimization algorithm: finds the minimum of the loss function (e.g., what is the optimum value of \(w_0\) and \(w_1\) in \(y=w_0 + w_1x\)?)

In this chapter we’ll be taking a look at optimization in detail and a particular optimization algorithm known as gradient descent.

2. Loss Functions


Loss functions (also often called “objective functions” or “cost functions”, although some debate that these are slightly different things) are what we use to map the performance of a model to a real number and it’s the thing we want to optimize! For example, here’s the mean squared error (MSE), which is a common loss function:

\[f(y,\hat{y})=\frac{1}{n}\sum^{n}_{i=1}(\hat{y_i}-y_i)^2\]

Where \(y_i\) is the actual response and \(\hat{y_i}\) is the predicted response.

Consider a simple linear regression model \(\hat{y_i} = w_0 + w_1x_i\), then our loss function is:

\[f(y,x,w)=\frac{1}{n}\sum^{n}_{i=1}((w_0 + w_1x_i)-y_i))^2\]

The optimization problem here is to find the values of \(w_0\) and \(w_1\) that minimizes the MSE.

3. Optimizing Linear Regression


I’m going to build up the intuition for optimization in a practical and visual way with the help of our old friend linear regression. If you’d prefer a more mathematical approach I encourage you to check out my colleague Mike Gelbart’s lecture on Youtube Thinking about Optimization or Chapter 7 of Mathematics for Machine Learning, by Deisenroth et al..

We’ll use a dataset of Pokemon “attack” and “defense” stats to do this, I’m going to start with just 10 observations:

df = (pd.read_csv("data/pokemon.csv", usecols=['name', 'defense', 'attack'], index_col=0)
        .head(10)
        .sort_values(by='defense')
        .reset_index()
     )
x = df['defense']
y = df['attack']
df
name attack defense
0 Caterpie 30 35
1 Charmander 52 43
2 Bulbasaur 49 49
3 Charmeleon 64 58
4 Ivysaur 62 63
5 Squirtle 48 65
6 Charizard 104 78
7 Wartortle 63 80
8 Blastoise 103 120
9 Venusaur 100 123

Throughout this chapter, I’m leveraging plotting scripts I imported from utils.plotting. I abstracted the code out of the notebook to avoid cluttering the material here and because how I made these plots is not important - but feel free to check out the source code if you wish!

plot_pokemon(x, y)