 # Chapter 1: Optimization & Gradient Descent¶

By Tomas Beuzen 🚀 ## Chapter Learning Objectives¶

• Explain the difference between a model, loss function, and optimization algorithm in the context of machine learning.

• Explain how the gradient descent algorithm works.

• Apply gradient descent to linear and logistic regression.

• Use scipy.optimize.minimize() to minimize a function.

## Imports¶

import numpy as np
import pandas as pd
from scipy.optimize import minimize
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from utils.plotting import *


## 1. Optimization and Machine Learning¶

In data science and computer science, we optimize a lot of stuff. For example, in linear regression we optimize for the intercept and coefficients of our model, in clustering algorithms like k-means we optimize our clusters, in neural networks we optimize the weights in our network (more on that in a later chapter!), etc.

In one sentence, “optimization” simply refers to minimizing/maximizing a function. For example, what value of $$x$$ minimizes the function $$f(x) = (x-2)^2 + 5$$? What is the minimum value? Answers: $$x=2$$, and $$f(x)=5$$.

If you’re reading this, you’re likely already familiar with machine learning. You can start to think of machine learning as a three-step process:

1. Choose your model: controls the space of possible functions that map $$X$$ to $$y$$ (e.g., a linear model can only learn linear functions)

2. Choose your loss function: tells us how to compare these various functions (e.g., is $$y=5 + 2x_1+3x_2$$ a better model than $$y=1 + 10x_1-x_2$$?)

3. Choose your optimization algorithm: finds the minimum of the loss function (e.g., what is the optimum value of $$w_0$$ and $$w_1$$ in $$y=w_0 + w_1x$$?)

In this chapter we’ll be taking a look at optimization in detail and a particular optimization algorithm known as gradient descent.

## 2. Loss Functions¶

Loss functions (also often called “objective functions” or “cost functions”, although some debate that these are slightly different things) are what we use to map the performance of a model to a real number and it’s the thing we want to optimize! For example, here’s the mean squared error (MSE), which is a common loss function:

$f(y,\hat{y})=\frac{1}{n}\sum^{n}_{i=1}(\hat{y_i}-y_i)^2$

Where $$y_i$$ is the actual response and $$\hat{y_i}$$ is the predicted response.

Consider a simple linear regression model $$\hat{y_i} = w_0 + w_1x_i$$, then our loss function is:

$f(y,x,w)=\frac{1}{n}\sum^{n}_{i=1}((w_0 + w_1x_i)-y_i))^2$

The optimization problem here is to find the values of $$w_0$$ and $$w_1$$ that minimizes the MSE.

## 3. Optimizing Linear Regression¶

I’m going to build up the intuition for optimization in a practical and visual way with the help of our old friend linear regression. If you’d prefer a more mathematical approach I encourage you to check out my colleague Mike Gelbart’s lecture on Youtube Thinking about Optimization or Chapter 7 of Mathematics for Machine Learning, by Deisenroth et al..

We’ll use a dataset of Pokemon “attack” and “defense” stats to do this, I’m going to start with just 10 observations:

df = (pd.read_csv("data/pokemon.csv", usecols=['name', 'defense', 'attack'], index_col=0)
.sort_values(by='defense')
.reset_index()
)
x = df['defense']
y = df['attack']
df

name attack defense
0 Caterpie 30 35
1 Charmander 52 43
2 Bulbasaur 49 49
3 Charmeleon 64 58
4 Ivysaur 62 63
5 Squirtle 48 65
6 Charizard 104 78
7 Wartortle 63 80
8 Blastoise 103 120
9 Venusaur 100 123

Throughout this chapter, I’m leveraging plotting scripts I imported from utils.plotting. I abstracted the code out of the notebook to avoid cluttering the material here and because how I made these plots is not important - but feel free to check out the source code if you wish!

plot_pokemon(x, y)