Appendix B: Logistic Loss

Imports


import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
import plotly.express as px

1. Logistic Regression Refresher


Logistic Regression is a classification model where we calculate the probability of an observation belonging to a class as:

\[z=w^Tx\]
\[\hat{y} = \frac{1}{(1+\exp(-z))}\]

And then assign that observation to a class based on some threshold (usually 0.5):

\[\begin{split}\text{Class }\hat{y}=\left\{ \begin{array}{ll} 0, & \hat{y}\le0.5 \\ 1, & \hat{y}>0.5 \\ \end{array} \right.\end{split}\]

2. Motivating the Loss Function


  • Below is the mean squared error as a loss function for optimizing linear regression:

\[f(w)=\frac{1}{n}\sum^{n}_{i=1}(\hat{y}-y_i))^2\]
  • That won’t work for logistic regression classification problems because it ends up being “non-convex” (which basically means there are multiple minima)

  • Instead we use the following loss function:

\[f(w)=-\frac{1}{n}\sum_{i=1}^ny_i\log\left(\frac{1}{1 + \exp(-w^Tx_i)}\right) + (1 - y_i)\log\left(1 - \frac{1}{1 + \exp(-w^Tx_i)}\right)\]
  • This function is called the “log loss” or “binary cross entropy”

  • I want to visually show you the differences in these two functions, and then we’ll discuss why that loss functions works

  • Recall the Pokemon dataset from Chapter 1, I’m going to load that in again (and standardize the data while I’m at it):

df = pd.read_csv("data/pokemon.csv", usecols=['name', 'defense', 'attack', 'speed', 'capture_rt', 'legendary'])
x = StandardScaler().fit_transform(df.drop(columns=["name", "legendary"]))
X = np.hstack((np.ones((len(x), 1)), x))
y = df['legendary'].to_numpy()
df.head()
name attack defense speed capture_rt legendary
0 Bulbasaur 49 49 45 45 0
1 Ivysaur 62 63 60 45 0
2 Venusaur 100 123 80 45 0
3 Charmander 52 43 65 45 0
4 Charmeleon 64 58 80 45 0
  • The goal here is to use the features (but not “name”, that’s just there for illustration purposes) to predict the target “legendary” (which takes values of 0/No and 1/Yes).

  • So we have 4 features meaning that our logistic regression model will have 5 parameters that need to be estimated (4 feature coefficients and 1 intercept)

  • At this point let’s define our loss functions:

def sigmoid(w, x):
    """Sigmoid function (i.e., logistic regression predictions)."""
    return 1 / (1 + np.exp(-x @ w))


def mse(w, x, y):
    """Mean squared error."""
    return np.mean((sigmoid(w, x) - y) ** 2)


def logistic_loss(w, x, y):
    """Logistic loss."""
    return -np.mean(y * np.log(sigmoid(w, x)) + (1 - y) * np.log(1 - sigmoid(w, x)))
  • For a moment, let’s assume a value for all the parameters execpt for \(w_1\)

  • We will then calculate the mean squared error for different values of \(w_1\) as in the code below

w1_arr = np.arange(-3, 6.1, 0.1)
losses = pd.DataFrame({"w1": w1_arr,
                       "mse": [mse([0.5, w1, -0.5, 0.5, -2], X, y) for w1 in w1_arr],
                       "log": [logistic_loss([0.5, w1, -0.5, 0.5, -2], X, y) for w1 in w1_arr]})
losses.head()
w1 mse log
0 -3.0 0.451184 1.604272
1 -2.9 0.446996 1.571701
2 -2.8 0.442773 1.539928
3 -2.7 0.438537 1.508997
4 -2.6 0.434309 1.478955
fig = px.line(losses.melt(id_vars="w1", var_name="loss"), x="w1", y="value", color="loss", facet_col="loss", facet_col_spacing=0.1)
fig.update_yaxes(matches=None, showticklabels=True, col=2)
fig.update_xaxes(matches=None, showticklabels=True, col=2)
fig.update_layout(width=800, height=400)