# Chapter 1: Optimization & Gradient DescentÂ¶

**By Tomas Beuzen đźš€**

## Chapter OutlineÂ¶

## Chapter Learning ObjectivesÂ¶

Explain the difference between a model, loss function, and optimization algorithm in the context of machine learning.

Explain how the gradient descent algorithm works.

Apply gradient descent to linear and logistic regression.

Use

`scipy.optimize.minimize()`

to minimize a function.

## ImportsÂ¶

```
import numpy as np
import pandas as pd
from scipy.optimize import minimize
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from utils.plotting import *
```

## 1. Optimization and Machine LearningÂ¶

In data science and computer science, we optimize a lot of stuff. For example, in linear regression we optimize for the intercept and coefficients of our model, in clustering algorithms like k-means we optimize our clusters, in neural networks we optimize the weights in our network (more on that in a later chapter!), etc.

In one sentence, â€śoptimizationâ€ť simply refers to minimizing/maximizing a function. For example, what value of \(x\) minimizes the function \(f(x) = (x-2)^2 + 5\)? What is the minimum value? Answers: \(x=2\), and \(f(x)=5\).

If youâ€™re reading this, youâ€™re likely already familiar with machine learning. You can start to think of machine learning as a three-step process:

**Choose your model**: controls the space of possible functions that map \(X\) to \(y\) (e.g., a linear model can only learn linear functions)**Choose your loss function**: tells us how to compare these various functions (e.g., is \(y=5 + 2x_1+3x_2\) a better model than \(y=1 + 10x_1-x_2\)?)**Choose your optimization algorithm**: finds the minimum of the loss function (e.g., what is the optimum value of \(w_0\) and \(w_1\) in \(y=w_0 + w_1x\)?)

In this chapter weâ€™ll be taking a look at optimization in detail and a particular optimization algorithm known as gradient descent.

## 2. Loss FunctionsÂ¶

Loss functions (also often called â€śobjective functionsâ€ť or â€ścost functionsâ€ť, although some debate that these are slightly different things) are what we use to map the performance of a model to a real number and itâ€™s the thing we want to optimize! For example, hereâ€™s the mean squared error (MSE), which is a common loss function:

Where \(y_i\) is the actual response and \(\hat{y_i}\) is the predicted response.

Consider a simple linear regression model \(\hat{y_i} = w_0 + w_1x_i\), then our loss function is:

The optimization problem here is to find the values of \(w_0\) and \(w_1\) that minimizes the MSE.

## 3. Optimizing Linear RegressionÂ¶

Iâ€™m going to build up the intuition for optimization in a practical and visual way with the help of our old friend linear regression. If youâ€™d prefer a more mathematical approach I encourage you to check out my colleague Mike Gelbartâ€™s lecture on Youtube Thinking about Optimization or Chapter 7 of Mathematics for Machine Learning, by Deisenroth et al..

Weâ€™ll use a dataset of Pokemon â€śattackâ€ť and â€śdefenseâ€ť stats to do this, Iâ€™m going to start with just 10 observations:

```
df = (pd.read_csv("data/pokemon.csv", usecols=['name', 'defense', 'attack'], index_col=0)
.head(10)
.sort_values(by='defense')
.reset_index()
)
x = df['defense']
y = df['attack']
df
```

name | attack | defense | |
---|---|---|---|

0 | Caterpie | 30 | 35 |

1 | Charmander | 52 | 43 |

2 | Bulbasaur | 49 | 49 |

3 | Charmeleon | 64 | 58 |

4 | Ivysaur | 62 | 63 |

5 | Squirtle | 48 | 65 |

6 | Charizard | 104 | 78 |

7 | Wartortle | 63 | 80 |

8 | Blastoise | 103 | 120 |

9 | Venusaur | 100 | 123 |

Throughout this chapter, Iâ€™m leveraging plotting scripts I imported from

`utils.plotting`

. I abstracted the code out of the notebook to avoid cluttering the material here and because how I made these plots is not important - but feel free to check out the source code if you wish!

```
plot_pokemon(x, y)
```