{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![](../docs/banner.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Chapter 8: Basic Data Wrangling With Pandas" ] }, { "cell_type": "markdown", "metadata": { "toc": true }, "source": [ "

Chapter Outline

\n", "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Chapter Learning Objectives\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Inspect a dataframe with `df.head()`, `df.tail()`, `df.info()`, `df.describe()`.\n", "- Obtain dataframe summaries with `df.info()` and `df.describe()`.\n", "- Manipulate how a dataframe displays in Jupyter by modifying Pandas configuration options such as `pd.set_option(\"display.max_rows\", n)`.\n", "- Rename columns of a dataframe using the `df.rename()` function or by accessing the `df.columns` attribute.\n", "- Modify the index name and index values of a dataframe using `.set_index()`, `.reset_index()` , `df.index.name`, `.index`.\n", "- Use `df.melt()` and `df.pivot()` to reshape dataframes, specifically to make tidy dataframes.\n", "- Combine dataframes using `df.merge()` and `pd.concat()` and know when to use these different methods.\n", "- Apply functions to a dataframe `df.apply()` and `df.applymap()`\n", "- Perform grouping and aggregating operations using `df.groupby()` and `df.agg()`.\n", "- Perform aggregating methods on grouped or ungrouped objects such as finding the minimum, maximum and sum of values in a dataframe using `df.agg()`.\n", "- Remove or fill missing values in a dataframe with `df.dropna()` and `df.fillna()`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. DataFrame Characteristics\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Last chapter we looked at how we can create dataframes. Let's now look at some helpful ways we can view our dataframe." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Head/Tail" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `.head()` and `.tail()` methods allow you to view the top/bottom *n* (default 5) rows of a dataframe. Let's load in the cycling data set from last chapter and try them out:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateNameTypeTimeDistanceComments
010 Sep 2019, 00:13:04Afternoon RideRide208412.62Rain
110 Sep 2019, 13:52:18Morning RideRide253113.03rain
211 Sep 2019, 00:23:50Afternoon RideRide186312.52Wet road but nice weather
311 Sep 2019, 14:06:19Morning RideRide219212.84Stopped for photo of sunrise
412 Sep 2019, 00:28:05Afternoon RideRide189112.48Tired by the end of the week
\n", "
" ], "text/plain": [ " Date Name Type Time Distance \\\n", "0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 \n", "1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 \n", "2 11 Sep 2019, 00:23:50 Afternoon Ride Ride 1863 12.52 \n", "3 11 Sep 2019, 14:06:19 Morning Ride Ride 2192 12.84 \n", "4 12 Sep 2019, 00:28:05 Afternoon Ride Ride 1891 12.48 \n", "\n", " Comments \n", "0 Rain \n", "1 rain \n", "2 Wet road but nice weather \n", "3 Stopped for photo of sunrise \n", "4 Tired by the end of the week " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('data/cycling_data.csv')\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The default return value is 5 rows, but we can pass in any number we like. For example, let's take a look at the top 10 rows:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateNameTypeTimeDistanceComments
010 Sep 2019, 00:13:04Afternoon RideRide208412.62Rain
110 Sep 2019, 13:52:18Morning RideRide253113.03rain
211 Sep 2019, 00:23:50Afternoon RideRide186312.52Wet road but nice weather
311 Sep 2019, 14:06:19Morning RideRide219212.84Stopped for photo of sunrise
412 Sep 2019, 00:28:05Afternoon RideRide189112.48Tired by the end of the week
516 Sep 2019, 13:57:48Morning RideRide227212.45Rested after the weekend!
617 Sep 2019, 00:15:47Afternoon RideRide197312.45Legs feeling strong!
717 Sep 2019, 13:43:34Morning RideRide228512.60Raining
818 Sep 2019, 13:49:53Morning RideRide290314.57Raining today
918 Sep 2019, 00:15:52Afternoon RideRide210112.48Pumped up tires
\n", "
" ], "text/plain": [ " Date Name Type Time Distance \\\n", "0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 \n", "1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 \n", "2 11 Sep 2019, 00:23:50 Afternoon Ride Ride 1863 12.52 \n", "3 11 Sep 2019, 14:06:19 Morning Ride Ride 2192 12.84 \n", "4 12 Sep 2019, 00:28:05 Afternoon Ride Ride 1891 12.48 \n", "5 16 Sep 2019, 13:57:48 Morning Ride Ride 2272 12.45 \n", "6 17 Sep 2019, 00:15:47 Afternoon Ride Ride 1973 12.45 \n", "7 17 Sep 2019, 13:43:34 Morning Ride Ride 2285 12.60 \n", "8 18 Sep 2019, 13:49:53 Morning Ride Ride 2903 14.57 \n", "9 18 Sep 2019, 00:15:52 Afternoon Ride Ride 2101 12.48 \n", "\n", " Comments \n", "0 Rain \n", "1 rain \n", "2 Wet road but nice weather \n", "3 Stopped for photo of sunrise \n", "4 Tired by the end of the week \n", "5 Rested after the weekend! \n", "6 Legs feeling strong! \n", "7 Raining \n", "8 Raining today \n", "9 Pumped up tires " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or the bottom 5 rows:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateNameTypeTimeDistanceComments
284 Oct 2019, 01:08:08Afternoon RideRide187012.63Very tired, riding into the wind
299 Oct 2019, 13:55:40Morning RideRide214912.70Really cold! But feeling good
3010 Oct 2019, 00:10:31Afternoon RideRide184112.59Feeling good after a holiday break!
3110 Oct 2019, 13:47:14Morning RideRide246312.79Stopped for photo of sunrise
3211 Oct 2019, 00:16:57Afternoon RideRide184311.79Bike feeling tight, needs an oil and pump
\n", "
" ], "text/plain": [ " Date Name Type Time Distance \\\n", "28 4 Oct 2019, 01:08:08 Afternoon Ride Ride 1870 12.63 \n", "29 9 Oct 2019, 13:55:40 Morning Ride Ride 2149 12.70 \n", "30 10 Oct 2019, 00:10:31 Afternoon Ride Ride 1841 12.59 \n", "31 10 Oct 2019, 13:47:14 Morning Ride Ride 2463 12.79 \n", "32 11 Oct 2019, 00:16:57 Afternoon Ride Ride 1843 11.79 \n", "\n", " Comments \n", "28 Very tired, riding into the wind \n", "29 Really cold! But feeling good \n", "30 Feeling good after a holiday break! \n", "31 Stopped for photo of sunrise \n", "32 Bike feeling tight, needs an oil and pump " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### DataFrame Summaries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Three very helpful attributes/functions for getting high-level summaries of your dataframe are:\n", "- `.shape`\n", "- `.info()`\n", "- `.describe()`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`.shape` is just like the ndarray attribute we've seen previously. It gives the shape (rows, cols) of your dataframe:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(33, 6)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`.info()` prints information about the dataframe itself, such as dtypes, memory usages, non-null values, etc:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 33 entries, 0 to 32\n", "Data columns (total 6 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Date 33 non-null object \n", " 1 Name 33 non-null object \n", " 2 Type 33 non-null object \n", " 3 Time 33 non-null int64 \n", " 4 Distance 31 non-null float64\n", " 5 Comments 33 non-null object \n", "dtypes: float64(1), int64(1), object(4)\n", "memory usage: 1.7+ KB\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`.describe()` provides summary statistics of the values within a dataframe:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TimeDistance
count33.00000031.000000
mean3512.78787912.667419
std8003.3092330.428618
min1712.00000011.790000
25%1863.00000012.480000
50%2118.00000012.620000
75%2285.00000012.750000
max48062.00000014.570000
\n", "
" ], "text/plain": [ " Time Distance\n", "count 33.000000 31.000000\n", "mean 3512.787879 12.667419\n", "std 8003.309233 0.428618\n", "min 1712.000000 11.790000\n", "25% 1863.000000 12.480000\n", "50% 2118.000000 12.620000\n", "75% 2285.000000 12.750000\n", "max 48062.000000 14.570000" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, `.describe()` only print summaries of numeric features. We can force it to give summaries on all features using the argument `include='all'` (although they may not make sense!):" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateNameTypeTimeDistanceComments
count33333333.00000031.00000033
unique3321NaNNaN25
top20 Sep 2019, 01:02:05Afternoon RideRideNaNNaNRested after the weekend!
freq11733NaNNaN3
meanNaNNaNNaN3512.78787912.667419NaN
stdNaNNaNNaN8003.3092330.428618NaN
minNaNNaNNaN1712.00000011.790000NaN
25%NaNNaNNaN1863.00000012.480000NaN
50%NaNNaNNaN2118.00000012.620000NaN
75%NaNNaNNaN2285.00000012.750000NaN
maxNaNNaNNaN48062.00000014.570000NaN
\n", "
" ], "text/plain": [ " Date Name Type Time Distance \\\n", "count 33 33 33 33.000000 31.000000 \n", "unique 33 2 1 NaN NaN \n", "top 20 Sep 2019, 01:02:05 Afternoon Ride Ride NaN NaN \n", "freq 1 17 33 NaN NaN \n", "mean NaN NaN NaN 3512.787879 12.667419 \n", "std NaN NaN NaN 8003.309233 0.428618 \n", "min NaN NaN NaN 1712.000000 11.790000 \n", "25% NaN NaN NaN 1863.000000 12.480000 \n", "50% NaN NaN NaN 2118.000000 12.620000 \n", "75% NaN NaN NaN 2285.000000 12.750000 \n", "max NaN NaN NaN 48062.000000 14.570000 \n", "\n", " Comments \n", "count 33 \n", "unique 25 \n", "top Rested after the weekend! \n", "freq 3 \n", "mean NaN \n", "std NaN \n", "min NaN \n", "25% NaN \n", "50% NaN \n", "75% NaN \n", "max NaN " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe(include='all')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Displaying DataFrames" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Displaying your dataframes effectively can be an important part of your workflow. If a dataframe has more than 60 rows, Pandas will only display the first 5 and last 5 rows:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
00.643224
10.617756
20.650490
30.289595
40.469394
......
950.549403
960.187836
970.016904
980.733392
990.875343
\n", "

100 rows × 1 columns

\n", "
" ], "text/plain": [ " 0\n", "0 0.643224\n", "1 0.617756\n", "2 0.650490\n", "3 0.289595\n", "4 0.469394\n", ".. ...\n", "95 0.549403\n", "96 0.187836\n", "97 0.016904\n", "98 0.733392\n", "99 0.875343\n", "\n", "[100 rows x 1 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(np.random.rand(100))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For dataframes of less than 60 rows, Pandas will print the whole dataframe:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateNameTypeTimeDistanceComments
010 Sep 2019, 00:13:04Afternoon RideRide208412.62Rain
110 Sep 2019, 13:52:18Morning RideRide253113.03rain
211 Sep 2019, 00:23:50Afternoon RideRide186312.52Wet road but nice weather
311 Sep 2019, 14:06:19Morning RideRide219212.84Stopped for photo of sunrise
412 Sep 2019, 00:28:05Afternoon RideRide189112.48Tired by the end of the week
516 Sep 2019, 13:57:48Morning RideRide227212.45Rested after the weekend!
617 Sep 2019, 00:15:47Afternoon RideRide197312.45Legs feeling strong!
717 Sep 2019, 13:43:34Morning RideRide228512.60Raining
818 Sep 2019, 13:49:53Morning RideRide290314.57Raining today
918 Sep 2019, 00:15:52Afternoon RideRide210112.48Pumped up tires
1019 Sep 2019, 00:30:01Afternoon RideRide4806212.48Feeling good
1119 Sep 2019, 13:52:09Morning RideRide209012.59Getting colder which is nice
1220 Sep 2019, 01:02:05Afternoon RideRide296112.81Feeling good
1323 Sep 2019, 13:50:41Morning RideRide246212.68Rested after the weekend!
1424 Sep 2019, 00:35:42Afternoon RideRide207612.47Oiled chain, bike feels smooth
1524 Sep 2019, 13:41:24Morning RideRide232112.68Bike feeling much smoother
1625 Sep 2019, 00:07:21Afternoon RideRide177512.10Feeling really tired
1725 Sep 2019, 13:35:41Morning RideRide212412.65Stopped for photo of sunrise
1826 Sep 2019, 00:13:33Afternoon RideRide186012.52raining
1926 Sep 2019, 13:42:43Morning RideRide235012.91Detour around trucks at Jericho
2027 Sep 2019, 01:00:18Afternoon RideRide171212.47Tired by the end of the week
2130 Sep 2019, 13:53:52Morning RideRide211812.71Rested after the weekend!
221 Oct 2019, 00:15:07Afternoon RideRide1732NaNLegs feeling strong!
231 Oct 2019, 13:45:55Morning RideRide222212.82Beautiful morning! Feeling fit
242 Oct 2019, 00:13:09Afternoon RideRide1756NaNA little tired today but good weather
252 Oct 2019, 13:46:06Morning RideRide213413.06Bit tired today but good weather
263 Oct 2019, 00:45:22Afternoon RideRide172412.52Feeling good
273 Oct 2019, 13:47:36Morning RideRide218212.68Wet road
284 Oct 2019, 01:08:08Afternoon RideRide187012.63Very tired, riding into the wind
299 Oct 2019, 13:55:40Morning RideRide214912.70Really cold! But feeling good
3010 Oct 2019, 00:10:31Afternoon RideRide184112.59Feeling good after a holiday break!
3110 Oct 2019, 13:47:14Morning RideRide246312.79Stopped for photo of sunrise
3211 Oct 2019, 00:16:57Afternoon RideRide184311.79Bike feeling tight, needs an oil and pump
\n", "
" ], "text/plain": [ " Date Name Type Time Distance \\\n", "0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 \n", "1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 \n", "2 11 Sep 2019, 00:23:50 Afternoon Ride Ride 1863 12.52 \n", "3 11 Sep 2019, 14:06:19 Morning Ride Ride 2192 12.84 \n", "4 12 Sep 2019, 00:28:05 Afternoon Ride Ride 1891 12.48 \n", "5 16 Sep 2019, 13:57:48 Morning Ride Ride 2272 12.45 \n", "6 17 Sep 2019, 00:15:47 Afternoon Ride Ride 1973 12.45 \n", "7 17 Sep 2019, 13:43:34 Morning Ride Ride 2285 12.60 \n", "8 18 Sep 2019, 13:49:53 Morning Ride Ride 2903 14.57 \n", "9 18 Sep 2019, 00:15:52 Afternoon Ride Ride 2101 12.48 \n", "10 19 Sep 2019, 00:30:01 Afternoon Ride Ride 48062 12.48 \n", "11 19 Sep 2019, 13:52:09 Morning Ride Ride 2090 12.59 \n", "12 20 Sep 2019, 01:02:05 Afternoon Ride Ride 2961 12.81 \n", "13 23 Sep 2019, 13:50:41 Morning Ride Ride 2462 12.68 \n", "14 24 Sep 2019, 00:35:42 Afternoon Ride Ride 2076 12.47 \n", "15 24 Sep 2019, 13:41:24 Morning Ride Ride 2321 12.68 \n", "16 25 Sep 2019, 00:07:21 Afternoon Ride Ride 1775 12.10 \n", "17 25 Sep 2019, 13:35:41 Morning Ride Ride 2124 12.65 \n", "18 26 Sep 2019, 00:13:33 Afternoon Ride Ride 1860 12.52 \n", "19 26 Sep 2019, 13:42:43 Morning Ride Ride 2350 12.91 \n", "20 27 Sep 2019, 01:00:18 Afternoon Ride Ride 1712 12.47 \n", "21 30 Sep 2019, 13:53:52 Morning Ride Ride 2118 12.71 \n", "22 1 Oct 2019, 00:15:07 Afternoon Ride Ride 1732 NaN \n", "23 1 Oct 2019, 13:45:55 Morning Ride Ride 2222 12.82 \n", "24 2 Oct 2019, 00:13:09 Afternoon Ride Ride 1756 NaN \n", "25 2 Oct 2019, 13:46:06 Morning Ride Ride 2134 13.06 \n", "26 3 Oct 2019, 00:45:22 Afternoon Ride Ride 1724 12.52 \n", "27 3 Oct 2019, 13:47:36 Morning Ride Ride 2182 12.68 \n", "28 4 Oct 2019, 01:08:08 Afternoon Ride Ride 1870 12.63 \n", "29 9 Oct 2019, 13:55:40 Morning Ride Ride 2149 12.70 \n", "30 10 Oct 2019, 00:10:31 Afternoon Ride Ride 1841 12.59 \n", "31 10 Oct 2019, 13:47:14 Morning Ride Ride 2463 12.79 \n", "32 11 Oct 2019, 00:16:57 Afternoon Ride Ride 1843 11.79 \n", "\n", " Comments \n", "0 Rain \n", "1 rain \n", "2 Wet road but nice weather \n", "3 Stopped for photo of sunrise \n", "4 Tired by the end of the week \n", "5 Rested after the weekend! \n", "6 Legs feeling strong! \n", "7 Raining \n", "8 Raining today \n", "9 Pumped up tires \n", "10 Feeling good \n", "11 Getting colder which is nice \n", "12 Feeling good \n", "13 Rested after the weekend! \n", "14 Oiled chain, bike feels smooth \n", "15 Bike feeling much smoother \n", "16 Feeling really tired \n", "17 Stopped for photo of sunrise \n", "18 raining \n", "19 Detour around trucks at Jericho \n", "20 Tired by the end of the week \n", "21 Rested after the weekend! \n", "22 Legs feeling strong! \n", "23 Beautiful morning! Feeling fit \n", "24 A little tired today but good weather \n", "25 Bit tired today but good weather \n", "26 Feeling good \n", "27 Wet road \n", "28 Very tired, riding into the wind \n", "29 Really cold! But feeling good \n", "30 Feeling good after a holiday break! \n", "31 Stopped for photo of sunrise \n", "32 Bike feeling tight, needs an oil and pump " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I find the 60 row threshold to be a little too much, I prefer something more like 20. You can change the setting using `pd.set_option(\"display.max_rows\", 20)` so that anything with more than 20 rows will be summarised by the first and last 5 rows as before:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateNameTypeTimeDistanceComments
010 Sep 2019, 00:13:04Afternoon RideRide208412.62Rain
110 Sep 2019, 13:52:18Morning RideRide253113.03rain
211 Sep 2019, 00:23:50Afternoon RideRide186312.52Wet road but nice weather
311 Sep 2019, 14:06:19Morning RideRide219212.84Stopped for photo of sunrise
412 Sep 2019, 00:28:05Afternoon RideRide189112.48Tired by the end of the week
.....................
284 Oct 2019, 01:08:08Afternoon RideRide187012.63Very tired, riding into the wind
299 Oct 2019, 13:55:40Morning RideRide214912.70Really cold! But feeling good
3010 Oct 2019, 00:10:31Afternoon RideRide184112.59Feeling good after a holiday break!
3110 Oct 2019, 13:47:14Morning RideRide246312.79Stopped for photo of sunrise
3211 Oct 2019, 00:16:57Afternoon RideRide184311.79Bike feeling tight, needs an oil and pump
\n", "

33 rows × 6 columns

\n", "
" ], "text/plain": [ " Date Name Type Time Distance \\\n", "0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 \n", "1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 \n", "2 11 Sep 2019, 00:23:50 Afternoon Ride Ride 1863 12.52 \n", "3 11 Sep 2019, 14:06:19 Morning Ride Ride 2192 12.84 \n", "4 12 Sep 2019, 00:28:05 Afternoon Ride Ride 1891 12.48 \n", ".. ... ... ... ... ... \n", "28 4 Oct 2019, 01:08:08 Afternoon Ride Ride 1870 12.63 \n", "29 9 Oct 2019, 13:55:40 Morning Ride Ride 2149 12.70 \n", "30 10 Oct 2019, 00:10:31 Afternoon Ride Ride 1841 12.59 \n", "31 10 Oct 2019, 13:47:14 Morning Ride Ride 2463 12.79 \n", "32 11 Oct 2019, 00:16:57 Afternoon Ride Ride 1843 11.79 \n", "\n", " Comments \n", "0 Rain \n", "1 rain \n", "2 Wet road but nice weather \n", "3 Stopped for photo of sunrise \n", "4 Tired by the end of the week \n", ".. ... \n", "28 Very tired, riding into the wind \n", "29 Really cold! But feeling good \n", "30 Feeling good after a holiday break! \n", "31 Stopped for photo of sunrise \n", "32 Bike feeling tight, needs an oil and pump \n", "\n", "[33 rows x 6 columns]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.set_option(\"display.max_rows\", 20)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are also other display options you can change, such as how many columns are shown, how numbers are formatted, etc. See the [official documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html#options-and-settings) for more.\n", "\n", "One display option I will point out is that Pandas allows you to style your tables, for example by highlighting negative values, or adding conditional colour maps to your dataframe. Below I'll style values based on their value ranging from negative (purple) to postive (yellow) but you can see the [styling documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html#Styling) for more examples." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
feature_0 feature_1 feature_2 feature_3 feature_4
row_0-0.519284-0.553179-1.7688101.023342-0.668625
row_1-1.0645820.2055380.350555-1.9427611.594635
row_21.0578860.074919-0.998969-0.0307250.677628
row_30.898228-0.697138-0.134935-1.592920-2.127225
row_40.0395820.372877-0.299539-2.1329722.194797
" ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test = pd.DataFrame(np.random.randn(5, 5),\n", " index = [f\"row_{_}\" for _ in range(5)],\n", " columns = [f\"feature_{_}\" for _ in range(5)])\n", "test.style.background_gradient(cmap='plasma')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Views vs Copies" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In previous chapters we've discussed views (\"looking\" at a part of an existing object) and copies (making a new copy of the object in memory). These things get a little abstract with Pandas and \"...it’s very hard to predict whether it will return a view or a copy\" (that's a quote straight [from a dedicated section in the Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy)).\n", "\n", "Basically, it depends on the operation you are trying to perform, your dataframe's structure and the memory layout of the underlying array. But don't worry, let me tell you all you need to know. Firstly, the most common warning you'll encounter in Pandas is the `SettingWithCopy`, Pandas raises it as a warning that you might not be doing what you think you're doing. Let's see an example. You may recall there is one outlier `Time` in our dataframe:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateNameTypeTimeDistanceComments
1019 Sep 2019, 00:30:01Afternoon RideRide4806212.48Feeling good
\n", "
" ], "text/plain": [ " Date Name Type Time Distance Comments\n", "10 19 Sep 2019, 00:30:01 Afternoon Ride Ride 48062 12.48 Feeling good" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df['Time'] > 4000]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Imagine we wanted to change this to `2000`. You'd probably do the following:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/opt/miniconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " \"\"\"Entry point for launching an IPython kernel.\n" ] } ], "source": [ "df[df['Time'] > 4000]['Time'] = 2000" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ah, there's that warning. Did our dataframe get changed?" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateNameTypeTimeDistanceComments
1019 Sep 2019, 00:30:01Afternoon RideRide4806212.48Feeling good
\n", "
" ], "text/plain": [ " Date Name Type Time Distance Comments\n", "10 19 Sep 2019, 00:30:01 Afternoon Ride Ride 48062 12.48 Feeling good" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df['Time'] > 4000]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "No it didn't, even though you probably thought it did. What happened above is that `df[df['Time'] > 4000]` was executed first and returned a copy of the dataframe, we can confirm by using `id()`:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The id of the original dataframe is: 5762156560\n", " The id of the indexed dataframe is: 5781171152\n" ] } ], "source": [ "print(f\"The id of the original dataframe is: {id(df)}\")\n", "print(f\" The id of the indexed dataframe is: {id(df[df['Time'] > 4000])}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We then tried to set a value on this new object by appending `['Time'] = 2000`. Pandas is warning us that we are doing that operation on a copy of the original dataframe, which is probably not what we want. To fix this, you need to index in a single go, using `.loc[]` for example:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "df.loc[df['Time'] > 4000, 'Time'] = 2000" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "No error this time! And let's confirm the change:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateNameTypeTimeDistanceComments
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [Date, Name, Type, Time, Distance, Comments]\n", "Index: []" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df['Time'] > 4000]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The second thing you need to know is that if you're ever in doubt about whether something is a view or a copy, you can just use the `.copy()` method to force a copy of a dataframe. Just like this:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "df2 = df[df['Time'] > 4000].copy()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That way, your guaranteed a copy that you can modify as you wish." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Basic DataFrame Manipulations\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Renaming Columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can rename columns two ways:\n", "1. Using `.rename()` (to selectively change column names)\n", "2. By setting the `.columns` attribute (to change all column names at once)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateNameTypeTimeDistanceComments
010 Sep 2019, 00:13:04Afternoon RideRide208412.62Rain
110 Sep 2019, 13:52:18Morning RideRide253113.03rain
211 Sep 2019, 00:23:50Afternoon RideRide186312.52Wet road but nice weather
311 Sep 2019, 14:06:19Morning RideRide219212.84Stopped for photo of sunrise
412 Sep 2019, 00:28:05Afternoon RideRide189112.48Tired by the end of the week
.....................
284 Oct 2019, 01:08:08Afternoon RideRide187012.63Very tired, riding into the wind
299 Oct 2019, 13:55:40Morning RideRide214912.70Really cold! But feeling good
3010 Oct 2019, 00:10:31Afternoon RideRide184112.59Feeling good after a holiday break!
3110 Oct 2019, 13:47:14Morning RideRide246312.79Stopped for photo of sunrise
3211 Oct 2019, 00:16:57Afternoon RideRide184311.79Bike feeling tight, needs an oil and pump
\n", "

33 rows × 6 columns

\n", "
" ], "text/plain": [ " Date Name Type Time Distance \\\n", "0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 \n", "1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 \n", "2 11 Sep 2019, 00:23:50 Afternoon Ride Ride 1863 12.52 \n", "3 11 Sep 2019, 14:06:19 Morning Ride Ride 2192 12.84 \n", "4 12 Sep 2019, 00:28:05 Afternoon Ride Ride 1891 12.48 \n", ".. ... ... ... ... ... \n", "28 4 Oct 2019, 01:08:08 Afternoon Ride Ride 1870 12.63 \n", "29 9 Oct 2019, 13:55:40 Morning Ride Ride 2149 12.70 \n", "30 10 Oct 2019, 00:10:31 Afternoon Ride Ride 1841 12.59 \n", "31 10 Oct 2019, 13:47:14 Morning Ride Ride 2463 12.79 \n", "32 11 Oct 2019, 00:16:57 Afternoon Ride Ride 1843 11.79 \n", "\n", " Comments \n", "0 Rain \n", "1 rain \n", "2 Wet road but nice weather \n", "3 Stopped for photo of sunrise \n", "4 Tired by the end of the week \n", ".. ... \n", "28 Very tired, riding into the wind \n", "29 Really cold! But feeling good \n", "30 Feeling good after a holiday break! \n", "31 Stopped for photo of sunrise \n", "32 Bike feeling tight, needs an oil and pump \n", "\n", "[33 rows x 6 columns]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's give it a go:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateNameTypeTimeDistanceComments
010 Sep 2019, 00:13:04Afternoon RideRide208412.62Rain
110 Sep 2019, 13:52:18Morning RideRide253113.03rain
211 Sep 2019, 00:23:50Afternoon RideRide186312.52Wet road but nice weather
311 Sep 2019, 14:06:19Morning RideRide219212.84Stopped for photo of sunrise
412 Sep 2019, 00:28:05Afternoon RideRide189112.48Tired by the end of the week
.....................
284 Oct 2019, 01:08:08Afternoon RideRide187012.63Very tired, riding into the wind
299 Oct 2019, 13:55:40Morning RideRide214912.70Really cold! But feeling good
3010 Oct 2019, 00:10:31Afternoon RideRide184112.59Feeling good after a holiday break!
3110 Oct 2019, 13:47:14Morning RideRide246312.79Stopped for photo of sunrise
3211 Oct 2019, 00:16:57Afternoon RideRide184311.79Bike feeling tight, needs an oil and pump
\n", "

33 rows × 6 columns

\n", "
" ], "text/plain": [ " Date Name Type Time Distance \\\n", "0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 \n", "1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 \n", "2 11 Sep 2019, 00:23:50 Afternoon Ride Ride 1863 12.52 \n", "3 11 Sep 2019, 14:06:19 Morning Ride Ride 2192 12.84 \n", "4 12 Sep 2019, 00:28:05 Afternoon Ride Ride 1891 12.48 \n", ".. ... ... ... ... ... \n", "28 4 Oct 2019, 01:08:08 Afternoon Ride Ride 1870 12.63 \n", "29 9 Oct 2019, 13:55:40 Morning Ride Ride 2149 12.70 \n", "30 10 Oct 2019, 00:10:31 Afternoon Ride Ride 1841 12.59 \n", "31 10 Oct 2019, 13:47:14 Morning Ride Ride 2463 12.79 \n", "32 11 Oct 2019, 00:16:57 Afternoon Ride Ride 1843 11.79 \n", "\n", " Comments \n", "0 Rain \n", "1 rain \n", "2 Wet road but nice weather \n", "3 Stopped for photo of sunrise \n", "4 Tired by the end of the week \n", ".. ... \n", "28 Very tired, riding into the wind \n", "29 Really cold! But feeling good \n", "30 Feeling good after a holiday break! \n", "31 Stopped for photo of sunrise \n", "32 Bike feeling tight, needs an oil and pump \n", "\n", "[33 rows x 6 columns]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.rename(columns={\"Date\": \"Datetime\",\n", " \"Comments\": \"Notes\"})\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wait? What happened? Nothing changed? In the code above we did actually rename columns of our dataframe but we didn't modify the dataframe inplace, we made a copy of it. There are generally two options for making permanent dataframe changes:\n", "- 1. Use the argument `inplace=True`, e.g., `df.rename(..., inplace=True)`, available in most functions/methods\n", "- 2. Re-assign, e.g., `df = df.rename(...)`\n", "The Pandas team recommends **Method 2 (re-assign)**, for a [few reasons](https://www.youtube.com/watch?v=hK6o_TDXXN8&t=700) (mostly to do with how memory is allocated under the hood)." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DatetimeNameTypeTimeDistanceNotes
010 Sep 2019, 00:13:04Afternoon RideRide208412.62Rain
110 Sep 2019, 13:52:18Morning RideRide253113.03rain
211 Sep 2019, 00:23:50Afternoon RideRide186312.52Wet road but nice weather
311 Sep 2019, 14:06:19Morning RideRide219212.84Stopped for photo of sunrise
412 Sep 2019, 00:28:05Afternoon RideRide189112.48Tired by the end of the week
.....................
284 Oct 2019, 01:08:08Afternoon RideRide187012.63Very tired, riding into the wind
299 Oct 2019, 13:55:40Morning RideRide214912.70Really cold! But feeling good
3010 Oct 2019, 00:10:31Afternoon RideRide184112.59Feeling good after a holiday break!
3110 Oct 2019, 13:47:14Morning RideRide246312.79Stopped for photo of sunrise
3211 Oct 2019, 00:16:57Afternoon RideRide184311.79Bike feeling tight, needs an oil and pump
\n", "

33 rows × 6 columns

\n", "
" ], "text/plain": [ " Datetime Name Type Time Distance \\\n", "0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 \n", "1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 \n", "2 11 Sep 2019, 00:23:50 Afternoon Ride Ride 1863 12.52 \n", "3 11 Sep 2019, 14:06:19 Morning Ride Ride 2192 12.84 \n", "4 12 Sep 2019, 00:28:05 Afternoon Ride Ride 1891 12.48 \n", ".. ... ... ... ... ... \n", "28 4 Oct 2019, 01:08:08 Afternoon Ride Ride 1870 12.63 \n", "29 9 Oct 2019, 13:55:40 Morning Ride Ride 2149 12.70 \n", "30 10 Oct 2019, 00:10:31 Afternoon Ride Ride 1841 12.59 \n", "31 10 Oct 2019, 13:47:14 Morning Ride Ride 2463 12.79 \n", "32 11 Oct 2019, 00:16:57 Afternoon Ride Ride 1843 11.79 \n", "\n", " Notes \n", "0 Rain \n", "1 rain \n", "2 Wet road but nice weather \n", "3 Stopped for photo of sunrise \n", "4 Tired by the end of the week \n", ".. ... \n", "28 Very tired, riding into the wind \n", "29 Really cold! But feeling good \n", "30 Feeling good after a holiday break! \n", "31 Stopped for photo of sunrise \n", "32 Bike feeling tight, needs an oil and pump \n", "\n", "[33 rows x 6 columns]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df.rename(columns={\"Date\": \"Datetime\",\n", " \"Comments\": \"Notes\"})\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you wish to change all of the columns of a dataframe, you can do so by setting the `.columns` attribute:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Column 1Column 2Column 3Column 4Column 5Column 6
010 Sep 2019, 00:13:04Afternoon RideRide208412.62Rain
110 Sep 2019, 13:52:18Morning RideRide253113.03rain
211 Sep 2019, 00:23:50Afternoon RideRide186312.52Wet road but nice weather
311 Sep 2019, 14:06:19Morning RideRide219212.84Stopped for photo of sunrise
412 Sep 2019, 00:28:05Afternoon RideRide189112.48Tired by the end of the week
.....................
284 Oct 2019, 01:08:08Afternoon RideRide187012.63Very tired, riding into the wind
299 Oct 2019, 13:55:40Morning RideRide214912.70Really cold! But feeling good
3010 Oct 2019, 00:10:31Afternoon RideRide184112.59Feeling good after a holiday break!
3110 Oct 2019, 13:47:14Morning RideRide246312.79Stopped for photo of sunrise
3211 Oct 2019, 00:16:57Afternoon RideRide184311.79Bike feeling tight, needs an oil and pump
\n", "

33 rows × 6 columns

\n", "
" ], "text/plain": [ " Column 1 Column 2 Column 3 Column 4 Column 5 \\\n", "0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 \n", "1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 \n", "2 11 Sep 2019, 00:23:50 Afternoon Ride Ride 1863 12.52 \n", "3 11 Sep 2019, 14:06:19 Morning Ride Ride 2192 12.84 \n", "4 12 Sep 2019, 00:28:05 Afternoon Ride Ride 1891 12.48 \n", ".. ... ... ... ... ... \n", "28 4 Oct 2019, 01:08:08 Afternoon Ride Ride 1870 12.63 \n", "29 9 Oct 2019, 13:55:40 Morning Ride Ride 2149 12.70 \n", "30 10 Oct 2019, 00:10:31 Afternoon Ride Ride 1841 12.59 \n", "31 10 Oct 2019, 13:47:14 Morning Ride Ride 2463 12.79 \n", "32 11 Oct 2019, 00:16:57 Afternoon Ride Ride 1843 11.79 \n", "\n", " Column 6 \n", "0 Rain \n", "1 rain \n", "2 Wet road but nice weather \n", "3 Stopped for photo of sunrise \n", "4 Tired by the end of the week \n", ".. ... \n", "28 Very tired, riding into the wind \n", "29 Really cold! But feeling good \n", "30 Feeling good after a holiday break! \n", "31 Stopped for photo of sunrise \n", "32 Bike feeling tight, needs an oil and pump \n", "\n", "[33 rows x 6 columns]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns = [f\"Column {_}\" for _ in range(1, 7)]\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Changing the Index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can change the index labels of a dataframe in 3 main ways:\n", "1. `.set_index()` to make one of the columns of the dataframe the index\n", "2. Directly modify `df.index.name` to change the index name\n", "3. `.reset_index()` to move the current index as a column and to reset the index with integer labels starting from 0\n", "4. Directly modify the `.index()` attribute" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Column 1Column 2Column 3Column 4Column 5Column 6
010 Sep 2019, 00:13:04Afternoon RideRide208412.62Rain
110 Sep 2019, 13:52:18Morning RideRide253113.03rain
211 Sep 2019, 00:23:50Afternoon RideRide186312.52Wet road but nice weather
311 Sep 2019, 14:06:19Morning RideRide219212.84Stopped for photo of sunrise
412 Sep 2019, 00:28:05Afternoon RideRide189112.48Tired by the end of the week
.....................
284 Oct 2019, 01:08:08Afternoon RideRide187012.63Very tired, riding into the wind
299 Oct 2019, 13:55:40Morning RideRide214912.70Really cold! But feeling good
3010 Oct 2019, 00:10:31Afternoon RideRide184112.59Feeling good after a holiday break!
3110 Oct 2019, 13:47:14Morning RideRide246312.79Stopped for photo of sunrise
3211 Oct 2019, 00:16:57Afternoon RideRide184311.79Bike feeling tight, needs an oil and pump
\n", "

33 rows × 6 columns

\n", "
" ], "text/plain": [ " Column 1 Column 2 Column 3 Column 4 Column 5 \\\n", "0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 \n", "1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 \n", "2 11 Sep 2019, 00:23:50 Afternoon Ride Ride 1863 12.52 \n", "3 11 Sep 2019, 14:06:19 Morning Ride Ride 2192 12.84 \n", "4 12 Sep 2019, 00:28:05 Afternoon Ride Ride 1891 12.48 \n", ".. ... ... ... ... ... \n", "28 4 Oct 2019, 01:08:08 Afternoon Ride Ride 1870 12.63 \n", "29 9 Oct 2019, 13:55:40 Morning Ride Ride 2149 12.70 \n", "30 10 Oct 2019, 00:10:31 Afternoon Ride Ride 1841 12.59 \n", "31 10 Oct 2019, 13:47:14 Morning Ride Ride 2463 12.79 \n", "32 11 Oct 2019, 00:16:57 Afternoon Ride Ride 1843 11.79 \n", "\n", " Column 6 \n", "0 Rain \n", "1 rain \n", "2 Wet road but nice weather \n", "3 Stopped for photo of sunrise \n", "4 Tired by the end of the week \n", ".. ... \n", "28 Very tired, riding into the wind \n", "29 Really cold! But feeling good \n", "30 Feeling good after a holiday break! \n", "31 Stopped for photo of sunrise \n", "32 Bike feeling tight, needs an oil and pump \n", "\n", "[33 rows x 6 columns]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below I will set the index as `Column 1` and rename the index to \"New Index\":" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Column 2Column 3Column 4Column 5Column 6
New Index
10 Sep 2019, 00:13:04Afternoon RideRide208412.62Rain
10 Sep 2019, 13:52:18Morning RideRide253113.03rain
11 Sep 2019, 00:23:50Afternoon RideRide186312.52Wet road but nice weather
11 Sep 2019, 14:06:19Morning RideRide219212.84Stopped for photo of sunrise
12 Sep 2019, 00:28:05Afternoon RideRide189112.48Tired by the end of the week
..................
4 Oct 2019, 01:08:08Afternoon RideRide187012.63Very tired, riding into the wind
9 Oct 2019, 13:55:40Morning RideRide214912.70Really cold! But feeling good
10 Oct 2019, 00:10:31Afternoon RideRide184112.59Feeling good after a holiday break!
10 Oct 2019, 13:47:14Morning RideRide246312.79Stopped for photo of sunrise
11 Oct 2019, 00:16:57Afternoon RideRide184311.79Bike feeling tight, needs an oil and pump
\n", "

33 rows × 5 columns

\n", "
" ], "text/plain": [ " Column 2 Column 3 Column 4 Column 5 \\\n", "New Index \n", "10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 \n", "10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 \n", "11 Sep 2019, 00:23:50 Afternoon Ride Ride 1863 12.52 \n", "11 Sep 2019, 14:06:19 Morning Ride Ride 2192 12.84 \n", "12 Sep 2019, 00:28:05 Afternoon Ride Ride 1891 12.48 \n", "... ... ... ... ... \n", "4 Oct 2019, 01:08:08 Afternoon Ride Ride 1870 12.63 \n", "9 Oct 2019, 13:55:40 Morning Ride Ride 2149 12.70 \n", "10 Oct 2019, 00:10:31 Afternoon Ride Ride 1841 12.59 \n", "10 Oct 2019, 13:47:14 Morning Ride Ride 2463 12.79 \n", "11 Oct 2019, 00:16:57 Afternoon Ride Ride 1843 11.79 \n", "\n", " Column 6 \n", "New Index \n", "10 Sep 2019, 00:13:04 Rain \n", "10 Sep 2019, 13:52:18 rain \n", "11 Sep 2019, 00:23:50 Wet road but nice weather \n", "11 Sep 2019, 14:06:19 Stopped for photo of sunrise \n", "12 Sep 2019, 00:28:05 Tired by the end of the week \n", "... ... \n", "4 Oct 2019, 01:08:08 Very tired, riding into the wind \n", "9 Oct 2019, 13:55:40 Really cold! But feeling good \n", "10 Oct 2019, 00:10:31 Feeling good after a holiday break! \n", "10 Oct 2019, 13:47:14 Stopped for photo of sunrise \n", "11 Oct 2019, 00:16:57 Bike feeling tight, needs an oil and pump \n", "\n", "[33 rows x 5 columns]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df.set_index(\"Column 1\")\n", "df.index.name = \"New Index\"\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I can send the index back to a column and have a default integer index using `.reset_index()`:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
New IndexColumn 2Column 3Column 4Column 5Column 6
010 Sep 2019, 00:13:04Afternoon RideRide208412.62Rain
110 Sep 2019, 13:52:18Morning RideRide253113.03rain
211 Sep 2019, 00:23:50Afternoon RideRide186312.52Wet road but nice weather
311 Sep 2019, 14:06:19Morning RideRide219212.84Stopped for photo of sunrise
412 Sep 2019, 00:28:05Afternoon RideRide189112.48Tired by the end of the week
.....................
284 Oct 2019, 01:08:08Afternoon RideRide187012.63Very tired, riding into the wind
299 Oct 2019, 13:55:40Morning RideRide214912.70Really cold! But feeling good
3010 Oct 2019, 00:10:31Afternoon RideRide184112.59Feeling good after a holiday break!
3110 Oct 2019, 13:47:14Morning RideRide246312.79Stopped for photo of sunrise
3211 Oct 2019, 00:16:57Afternoon RideRide184311.79Bike feeling tight, needs an oil and pump
\n", "

33 rows × 6 columns

\n", "
" ], "text/plain": [ " New Index Column 2 Column 3 Column 4 Column 5 \\\n", "0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 \n", "1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 \n", "2 11 Sep 2019, 00:23:50 Afternoon Ride Ride 1863 12.52 \n", "3 11 Sep 2019, 14:06:19 Morning Ride Ride 2192 12.84 \n", "4 12 Sep 2019, 00:28:05 Afternoon Ride Ride 1891 12.48 \n", ".. ... ... ... ... ... \n", "28 4 Oct 2019, 01:08:08 Afternoon Ride Ride 1870 12.63 \n", "29 9 Oct 2019, 13:55:40 Morning Ride Ride 2149 12.70 \n", "30 10 Oct 2019, 00:10:31 Afternoon Ride Ride 1841 12.59 \n", "31 10 Oct 2019, 13:47:14 Morning Ride Ride 2463 12.79 \n", "32 11 Oct 2019, 00:16:57 Afternoon Ride Ride 1843 11.79 \n", "\n", " Column 6 \n", "0 Rain \n", "1 rain \n", "2 Wet road but nice weather \n", "3 Stopped for photo of sunrise \n", "4 Tired by the end of the week \n", ".. ... \n", "28 Very tired, riding into the wind \n", "29 Really cold! But feeling good \n", "30 Feeling good after a holiday break! \n", "31 Stopped for photo of sunrise \n", "32 Bike feeling tight, needs an oil and pump \n", "\n", "[33 rows x 6 columns]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df.reset_index()\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like with column names, we can also modify the index directly, but I can't remember ever doing this, usually I'll use `.set_index()`:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=33, step=1)" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.index" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
New IndexColumn 2Column 3Column 4Column 5Column 6
10010 Sep 2019, 00:13:04Afternoon RideRide208412.62Rain
10110 Sep 2019, 13:52:18Morning RideRide253113.03rain
10211 Sep 2019, 00:23:50Afternoon RideRide186312.52Wet road but nice weather
10311 Sep 2019, 14:06:19Morning RideRide219212.84Stopped for photo of sunrise
10412 Sep 2019, 00:28:05Afternoon RideRide189112.48Tired by the end of the week
.....................
1284 Oct 2019, 01:08:08Afternoon RideRide187012.63Very tired, riding into the wind
1299 Oct 2019, 13:55:40Morning RideRide214912.70Really cold! But feeling good
13010 Oct 2019, 00:10:31Afternoon RideRide184112.59Feeling good after a holiday break!
13110 Oct 2019, 13:47:14Morning RideRide246312.79Stopped for photo of sunrise
13211 Oct 2019, 00:16:57Afternoon RideRide184311.79Bike feeling tight, needs an oil and pump
\n", "

33 rows × 6 columns

\n", "
" ], "text/plain": [ " New Index Column 2 Column 3 Column 4 Column 5 \\\n", "100 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 \n", "101 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 \n", "102 11 Sep 2019, 00:23:50 Afternoon Ride Ride 1863 12.52 \n", "103 11 Sep 2019, 14:06:19 Morning Ride Ride 2192 12.84 \n", "104 12 Sep 2019, 00:28:05 Afternoon Ride Ride 1891 12.48 \n", ".. ... ... ... ... ... \n", "128 4 Oct 2019, 01:08:08 Afternoon Ride Ride 1870 12.63 \n", "129 9 Oct 2019, 13:55:40 Morning Ride Ride 2149 12.70 \n", "130 10 Oct 2019, 00:10:31 Afternoon Ride Ride 1841 12.59 \n", "131 10 Oct 2019, 13:47:14 Morning Ride Ride 2463 12.79 \n", "132 11 Oct 2019, 00:16:57 Afternoon Ride Ride 1843 11.79 \n", "\n", " Column 6 \n", "100 Rain \n", "101 rain \n", "102 Wet road but nice weather \n", "103 Stopped for photo of sunrise \n", "104 Tired by the end of the week \n", ".. ... \n", "128 Very tired, riding into the wind \n", "129 Really cold! But feeling good \n", "130 Feeling good after a holiday break! \n", "131 Stopped for photo of sunrise \n", "132 Bike feeling tight, needs an oil and pump \n", "\n", "[33 rows x 6 columns]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.index = range(100, 133, 1)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adding/Removing Columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are two main ways to add/remove columns of a dataframe:\n", "1. Use `[]` to add columns\n", "2. Use `.drop()` to drop columns\n", "\n", "Let's re-read in a fresh copy of the cycling dataset." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateNameTypeTimeDistanceComments
010 Sep 2019, 00:13:04Afternoon RideRide208412.62Rain
110 Sep 2019, 13:52:18Morning RideRide253113.03rain
211 Sep 2019, 00:23:50Afternoon RideRide186312.52Wet road but nice weather
311 Sep 2019, 14:06:19Morning RideRide219212.84Stopped for photo of sunrise
412 Sep 2019, 00:28:05Afternoon RideRide189112.48Tired by the end of the week
.....................
284 Oct 2019, 01:08:08Afternoon RideRide187012.63Very tired, riding into the wind
299 Oct 2019, 13:55:40Morning RideRide214912.70Really cold! But feeling good
3010 Oct 2019, 00:10:31Afternoon RideRide184112.59Feeling good after a holiday break!
3110 Oct 2019, 13:47:14Morning RideRide246312.79Stopped for photo of sunrise
3211 Oct 2019, 00:16:57Afternoon RideRide184311.79Bike feeling tight, needs an oil and pump
\n", "

33 rows × 6 columns

\n", "
" ], "text/plain": [ " Date Name Type Time Distance \\\n", "0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 \n", "1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 \n", "2 11 Sep 2019, 00:23:50 Afternoon Ride Ride 1863 12.52 \n", "3 11 Sep 2019, 14:06:19 Morning Ride Ride 2192 12.84 \n", "4 12 Sep 2019, 00:28:05 Afternoon Ride Ride 1891 12.48 \n", ".. ... ... ... ... ... \n", "28 4 Oct 2019, 01:08:08 Afternoon Ride Ride 1870 12.63 \n", "29 9 Oct 2019, 13:55:40 Morning Ride Ride 2149 12.70 \n", "30 10 Oct 2019, 00:10:31 Afternoon Ride Ride 1841 12.59 \n", "31 10 Oct 2019, 13:47:14 Morning Ride Ride 2463 12.79 \n", "32 11 Oct 2019, 00:16:57 Afternoon Ride Ride 1843 11.79 \n", "\n", " Comments \n", "0 Rain \n", "1 rain \n", "2 Wet road but nice weather \n", "3 Stopped for photo of sunrise \n", "4 Tired by the end of the week \n", ".. ... \n", "28 Very tired, riding into the wind \n", "29 Really cold! But feeling good \n", "30 Feeling good after a holiday break! \n", "31 Stopped for photo of sunrise \n", "32 Bike feeling tight, needs an oil and pump \n", "\n", "[33 rows x 6 columns]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('data/cycling_data.csv')\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can add a new column to a dataframe by simply using `[]` with a new column name and value(s):" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateNameTypeTimeDistanceCommentsRiderAvg Speed
010 Sep 2019, 00:13:04Afternoon RideRide208412.62RainTom Beuzen6.055662
110 Sep 2019, 13:52:18Morning RideRide253113.03rainTom Beuzen5.148163
211 Sep 2019, 00:23:50Afternoon RideRide186312.52Wet road but nice weatherTom Beuzen6.720344
311 Sep 2019, 14:06:19Morning RideRide219212.84Stopped for photo of sunriseTom Beuzen5.857664
412 Sep 2019, 00:28:05Afternoon RideRide189112.48Tired by the end of the weekTom Beuzen6.599683
...........................
284 Oct 2019, 01:08:08Afternoon RideRide187012.63Very tired, riding into the windTom Beuzen6.754011
299 Oct 2019, 13:55:40Morning RideRide214912.70Really cold! But feeling goodTom Beuzen5.909725
3010 Oct 2019, 00:10:31Afternoon RideRide184112.59Feeling good after a holiday break!Tom Beuzen6.838675
3110 Oct 2019, 13:47:14Morning RideRide246312.79Stopped for photo of sunriseTom Beuzen5.192854
3211 Oct 2019, 00:16:57Afternoon RideRide184311.79Bike feeling tight, needs an oil and pumpTom Beuzen6.397179
\n", "

33 rows × 8 columns

\n", "
" ], "text/plain": [ " Date Name Type Time Distance \\\n", "0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 \n", "1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 \n", "2 11 Sep 2019, 00:23:50 Afternoon Ride Ride 1863 12.52 \n", "3 11 Sep 2019, 14:06:19 Morning Ride Ride 2192 12.84 \n", "4 12 Sep 2019, 00:28:05 Afternoon Ride Ride 1891 12.48 \n", ".. ... ... ... ... ... \n", "28 4 Oct 2019, 01:08:08 Afternoon Ride Ride 1870 12.63 \n", "29 9 Oct 2019, 13:55:40 Morning Ride Ride 2149 12.70 \n", "30 10 Oct 2019, 00:10:31 Afternoon Ride Ride 1841 12.59 \n", "31 10 Oct 2019, 13:47:14 Morning Ride Ride 2463 12.79 \n", "32 11 Oct 2019, 00:16:57 Afternoon Ride Ride 1843 11.79 \n", "\n", " Comments Rider Avg Speed \n", "0 Rain Tom Beuzen 6.055662 \n", "1 rain Tom Beuzen 5.148163 \n", "2 Wet road but nice weather Tom Beuzen 6.720344 \n", "3 Stopped for photo of sunrise Tom Beuzen 5.857664 \n", "4 Tired by the end of the week Tom Beuzen 6.599683 \n", ".. ... ... ... \n", "28 Very tired, riding into the wind Tom Beuzen 6.754011 \n", "29 Really cold! But feeling good Tom Beuzen 5.909725 \n", "30 Feeling good after a holiday break! Tom Beuzen 6.838675 \n", "31 Stopped for photo of sunrise Tom Beuzen 5.192854 \n", "32 Bike feeling tight, needs an oil and pump Tom Beuzen 6.397179 \n", "\n", "[33 rows x 8 columns]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Rider'] = 'Tom Beuzen'\n", "df['Avg Speed'] = df['Distance'] * 1000 / df['Time'] # avg. speed in m/s\n", "df" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateNameTypeTimeDistanceComments
010 Sep 2019, 00:13:04Afternoon RideRide208412.62Rain
110 Sep 2019, 13:52:18Morning RideRide253113.03rain
211 Sep 2019, 00:23:50Afternoon RideRide186312.52Wet road but nice weather
311 Sep 2019, 14:06:19Morning RideRide219212.84Stopped for photo of sunrise
412 Sep 2019, 00:28:05Afternoon RideRide189112.48Tired by the end of the week
.....................
284 Oct 2019, 01:08:08Afternoon RideRide187012.63Very tired, riding into the wind
299 Oct 2019, 13:55:40Morning RideRide214912.70Really cold! But feeling good
3010 Oct 2019, 00:10:31Afternoon RideRide184112.59Feeling good after a holiday break!
3110 Oct 2019, 13:47:14Morning RideRide246312.79Stopped for photo of sunrise
3211 Oct 2019, 00:16:57Afternoon RideRide184311.79Bike feeling tight, needs an oil and pump
\n", "

33 rows × 6 columns

\n", "
" ], "text/plain": [ " Date Name Type Time Distance \\\n", "0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 \n", "1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 \n", "2 11 Sep 2019, 00:23:50 Afternoon Ride Ride 1863 12.52 \n", "3 11 Sep 2019, 14:06:19 Morning Ride Ride 2192 12.84 \n", "4 12 Sep 2019, 00:28:05 Afternoon Ride Ride 1891 12.48 \n", ".. ... ... ... ... ... \n", "28 4 Oct 2019, 01:08:08 Afternoon Ride Ride 1870 12.63 \n", "29 9 Oct 2019, 13:55:40 Morning Ride Ride 2149 12.70 \n", "30 10 Oct 2019, 00:10:31 Afternoon Ride Ride 1841 12.59 \n", "31 10 Oct 2019, 13:47:14 Morning Ride Ride 2463 12.79 \n", "32 11 Oct 2019, 00:16:57 Afternoon Ride Ride 1843 11.79 \n", "\n", " Comments \n", "0 Rain \n", "1 rain \n", "2 Wet road but nice weather \n", "3 Stopped for photo of sunrise \n", "4 Tired by the end of the week \n", ".. ... \n", "28 Very tired, riding into the wind \n", "29 Really cold! But feeling good \n", "30 Feeling good after a holiday break! \n", "31 Stopped for photo of sunrise \n", "32 Bike feeling tight, needs an oil and pump \n", "\n", "[33 rows x 6 columns]" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df.drop(columns=['Rider', 'Avg Speed'])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adding/Removing Rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You won't often be adding rows to a dataframe manually (you'll usually add rows through concatenating/joining - that's coming up next). You can add/remove rows of a dataframe in two ways:\n", "1. Use `.append()` to add rows\n", "2. Use `.drop()` to drop rows" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateNameTypeTimeDistanceComments
010 Sep 2019, 00:13:04Afternoon RideRide208412.62Rain
110 Sep 2019, 13:52:18Morning RideRide253113.03rain
211 Sep 2019, 00:23:50Afternoon RideRide186312.52Wet road but nice weather
311 Sep 2019, 14:06:19Morning RideRide219212.84Stopped for photo of sunrise
412 Sep 2019, 00:28:05Afternoon RideRide189112.48Tired by the end of the week
.....................
284 Oct 2019, 01:08:08Afternoon RideRide187012.63Very tired, riding into the wind
299 Oct 2019, 13:55:40Morning RideRide214912.70Really cold! But feeling good
3010 Oct 2019, 00:10:31Afternoon RideRide184112.59Feeling good after a holiday break!
3110 Oct 2019, 13:47:14Morning RideRide246312.79Stopped for photo of sunrise
3211 Oct 2019, 00:16:57Afternoon RideRide184311.79Bike feeling tight, needs an oil and pump
\n", "

33 rows × 6 columns

\n", "
" ], "text/plain": [ " Date Name Type Time Distance \\\n", "0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 \n", "1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 \n", "2 11 Sep 2019, 00:23:50 Afternoon Ride Ride 1863 12.52 \n", "3 11 Sep 2019, 14:06:19 Morning Ride Ride 2192 12.84 \n", "4 12 Sep 2019, 00:28:05 Afternoon Ride Ride 1891 12.48 \n", ".. ... ... ... ... ... \n", "28 4 Oct 2019, 01:08:08 Afternoon Ride Ride 1870 12.63 \n", "29 9 Oct 2019, 13:55:40 Morning Ride Ride 2149 12.70 \n", "30 10 Oct 2019, 00:10:31 Afternoon Ride Ride 1841 12.59 \n", "31 10 Oct 2019, 13:47:14 Morning Ride Ride 2463 12.79 \n", "32 11 Oct 2019, 00:16:57 Afternoon Ride Ride 1843 11.79 \n", "\n", " Comments \n", "0 Rain \n", "1 rain \n", "2 Wet road but nice weather \n", "3 Stopped for photo of sunrise \n", "4 Tired by the end of the week \n", ".. ... \n", "28 Very tired, riding into the wind \n", "29 Really cold! But feeling good \n", "30 Feeling good after a holiday break! \n", "31 Stopped for photo of sunrise \n", "32 Bike feeling tight, needs an oil and pump \n", "\n", "[33 rows x 6 columns]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's add a new row to the bottom of this dataframe:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateNameTypeTimeDistanceComments
010 Sep 2019, 00:13:04Afternoon RideRide208412.62Rain
110 Sep 2019, 13:52:18Morning RideRide253113.03rain
211 Sep 2019, 00:23:50Afternoon RideRide186312.52Wet road but nice weather
311 Sep 2019, 14:06:19Morning RideRide219212.84Stopped for photo of sunrise
412 Sep 2019, 00:28:05Afternoon RideRide189112.48Tired by the end of the week
.....................
299 Oct 2019, 13:55:40Morning RideRide214912.70Really cold! But feeling good
3010 Oct 2019, 00:10:31Afternoon RideRide184112.59Feeling good after a holiday break!
3110 Oct 2019, 13:47:14Morning RideRide246312.79Stopped for photo of sunrise
3211 Oct 2019, 00:16:57Afternoon RideRide184311.79Bike feeling tight, needs an oil and pump
3312 Oct 2019, 00:10:57Morning RideRide233112.67Washed and oiled bike last night
\n", "

34 rows × 6 columns

\n", "
" ], "text/plain": [ " Date Name Type Time Distance \\\n", "0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 \n", "1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 \n", "2 11 Sep 2019, 00:23:50 Afternoon Ride Ride 1863 12.52 \n", "3 11 Sep 2019, 14:06:19 Morning Ride Ride 2192 12.84 \n", "4 12 Sep 2019, 00:28:05 Afternoon Ride Ride 1891 12.48 \n", ".. ... ... ... ... ... \n", "29 9 Oct 2019, 13:55:40 Morning Ride Ride 2149 12.70 \n", "30 10 Oct 2019, 00:10:31 Afternoon Ride Ride 1841 12.59 \n", "31 10 Oct 2019, 13:47:14 Morning Ride Ride 2463 12.79 \n", "32 11 Oct 2019, 00:16:57 Afternoon Ride Ride 1843 11.79 \n", "33 12 Oct 2019, 00:10:57 Morning Ride Ride 2331 12.67 \n", "\n", " Comments \n", "0 Rain \n", "1 rain \n", "2 Wet road but nice weather \n", "3 Stopped for photo of sunrise \n", "4 Tired by the end of the week \n", ".. ... \n", "29 Really cold! But feeling good \n", "30 Feeling good after a holiday break! \n", "31 Stopped for photo of sunrise \n", "32 Bike feeling tight, needs an oil and pump \n", "33 Washed and oiled bike last night \n", "\n", "[34 rows x 6 columns]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "another_row = pd.DataFrame([[\"12 Oct 2019, 00:10:57\", \"Morning Ride\", \"Ride\",\n", " 2331, 12.67, \"Washed and oiled bike last night\"]],\n", " columns = df.columns,\n", " index = [33])\n", "df = df.append(another_row)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can drop all rows above index 30 using `.drop()`:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateNameTypeTimeDistanceComments
010 Sep 2019, 00:13:04Afternoon RideRide208412.62Rain
110 Sep 2019, 13:52:18Morning RideRide253113.03rain
211 Sep 2019, 00:23:50Afternoon RideRide186312.52Wet road but nice weather
311 Sep 2019, 14:06:19Morning RideRide219212.84Stopped for photo of sunrise
412 Sep 2019, 00:28:05Afternoon RideRide189112.48Tired by the end of the week
.....................
252 Oct 2019, 13:46:06Morning RideRide213413.06Bit tired today but good weather
263 Oct 2019, 00:45:22Afternoon RideRide172412.52Feeling good
273 Oct 2019, 13:47:36Morning RideRide218212.68Wet road
284 Oct 2019, 01:08:08Afternoon RideRide187012.63Very tired, riding into the wind
299 Oct 2019, 13:55:40Morning RideRide214912.70Really cold! But feeling good
\n", "

30 rows × 6 columns

\n", "
" ], "text/plain": [ " Date Name Type Time Distance \\\n", "0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 \n", "1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 \n", "2 11 Sep 2019, 00:23:50 Afternoon Ride Ride 1863 12.52 \n", "3 11 Sep 2019, 14:06:19 Morning Ride Ride 2192 12.84 \n", "4 12 Sep 2019, 00:28:05 Afternoon Ride Ride 1891 12.48 \n", ".. ... ... ... ... ... \n", "25 2 Oct 2019, 13:46:06 Morning Ride Ride 2134 13.06 \n", "26 3 Oct 2019, 00:45:22 Afternoon Ride Ride 1724 12.52 \n", "27 3 Oct 2019, 13:47:36 Morning Ride Ride 2182 12.68 \n", "28 4 Oct 2019, 01:08:08 Afternoon Ride Ride 1870 12.63 \n", "29 9 Oct 2019, 13:55:40 Morning Ride Ride 2149 12.70 \n", "\n", " Comments \n", "0 Rain \n", "1 rain \n", "2 Wet road but nice weather \n", "3 Stopped for photo of sunrise \n", "4 Tired by the end of the week \n", ".. ... \n", "25 Bit tired today but good weather \n", "26 Feeling good \n", "27 Wet road \n", "28 Very tired, riding into the wind \n", "29 Really cold! But feeling good \n", "\n", "[30 rows x 6 columns]" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop(index=range(30, 34))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. DataFrame Reshaping\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Tidy data](https://vita.had.co.nz/papers/tidy-data.pdf) is about \"linking the structure of a dataset with its semantics (its meaning)\". It is defined by:\n", "1. Each variable forms a column\n", "2. Each observation forms a row\n", "3. Each type of observational unit forms a table\n", "\n", "Often you'll need to reshape a dataframe to make it tidy (or for some other purpose).\n", " \n", "![](img/chapter8/tidy.png)\n", "\n", "Source: [r4ds](https://r4ds.had.co.nz/tidy-data.html#fig:tidy-structure)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Melt and Pivot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas `.melt()`, `.pivot()` and `.pivot_table()` can help reshape dataframes\n", "- `.melt()`: make wide data long.\n", "- `.pivot()`: make long data width.\n", "- `.pivot_table()`: same as `.pivot()` but can handle multiple indexes.\n", " \n", "![](img/chapter8/melt_pivot.gif)\n", "\n", "Source: [Garrick Aden-Buie's GitHub](https://github.com/gadenbuie/tidyexplain#spread-and-gather)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The below data shows how many courses different instructors taught across different years. If the question you want to answer is something like: \"Does the number of courses taught vary depending on year?\" then the below would probably not be considered tidy because there are multiple observations of courses taught in a year per row (i.e., there is data for 2018, 2019 and 2020 in a single row):" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Name201820192020
0Tom125
1Mike342
2Tiffany434
3Varada524
4Joel313
\n", "
" ], "text/plain": [ " Name 2018 2019 2020\n", "0 Tom 1 2 5\n", "1 Mike 3 4 2\n", "2 Tiffany 4 3 4\n", "3 Varada 5 2 4\n", "4 Joel 3 1 3" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({\"Name\": [\"Tom\", \"Mike\", \"Tiffany\", \"Varada\", \"Joel\"],\n", " \"2018\": [1, 3, 4, 5, 3],\n", " \"2019\": [2, 4, 3, 2, 1],\n", " \"2020\": [5, 2, 4, 4, 3]})\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's make it tidy with `.melt()`. `.melt()` takes a few arguments, most important is the `id_vars` which indicated which column should be the \"identifier\"." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameYearCourses
0Tom20181
1Mike20183
2Tiffany20184
3Varada20185
4Joel20183
5Tom20192
6Mike20194
7Tiffany20193
8Varada20192
9Joel20191
10Tom20205
11Mike20202
12Tiffany20204
13Varada20204
14Joel20203
\n", "
" ], "text/plain": [ " Name Year Courses\n", "0 Tom 2018 1\n", "1 Mike 2018 3\n", "2 Tiffany 2018 4\n", "3 Varada 2018 5\n", "4 Joel 2018 3\n", "5 Tom 2019 2\n", "6 Mike 2019 4\n", "7 Tiffany 2019 3\n", "8 Varada 2019 2\n", "9 Joel 2019 1\n", "10 Tom 2020 5\n", "11 Mike 2020 2\n", "12 Tiffany 2020 4\n", "13 Varada 2020 4\n", "14 Joel 2020 3" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_melt = df.melt(id_vars=\"Name\",\n", " var_name=\"Year\",\n", " value_name=\"Courses\")\n", "df_melt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `value_vars` argument allows us to select which specific variables we want to \"melt\" (if you don't specify `value_vars`, all non-identifier columns will be used). For example, below I'm omitting the `2018` column:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameYearCourses
0Tom20192
1Mike20194
2Tiffany20193
3Varada20192
4Joel20191
5Tom20205
6Mike20202
7Tiffany20204
8Varada20204
9Joel20203
\n", "
" ], "text/plain": [ " Name Year Courses\n", "0 Tom 2019 2\n", "1 Mike 2019 4\n", "2 Tiffany 2019 3\n", "3 Varada 2019 2\n", "4 Joel 2019 1\n", "5 Tom 2020 5\n", "6 Mike 2020 2\n", "7 Tiffany 2020 4\n", "8 Varada 2020 4\n", "9 Joel 2020 3" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.melt(id_vars=\"Name\",\n", " value_vars=[\"2019\", \"2020\"],\n", " var_name=\"Year\",\n", " value_name=\"Courses\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes, you want to make long data wide, which we can do with `.pivot()`. When using `.pivot()` we need to specify the `index` to pivot on, and the `columns` that will be used to make the new columns of the wider dataframe:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Year201820192020
Name
Joel313
Mike342
Tiffany434
Tom125
Varada524
\n", "
" ], "text/plain": [ "Year 2018 2019 2020\n", "Name \n", "Joel 3 1 3\n", "Mike 3 4 2\n", "Tiffany 4 3 4\n", "Tom 1 2 5\n", "Varada 5 2 4" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_pivot = df_melt.pivot(index=\"Name\",\n", " columns=\"Year\",\n", " values=\"Courses\")\n", "df_pivot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You'll notice that Pandas set our specified `index` as the index of the new dataframe and preserved the label of the columns. We can easily remove these names and reset the index to make our dataframe look like it originally did:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Name201820192020
0Joel313
1Mike342
2Tiffany434
3Tom125
4Varada524
\n", "
" ], "text/plain": [ " Name 2018 2019 2020\n", "0 Joel 3 1 3\n", "1 Mike 3 4 2\n", "2 Tiffany 4 3 4\n", "3 Tom 1 2 5\n", "4 Varada 5 2 4" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_pivot = df_pivot.reset_index()\n", "df_pivot.columns.name = None\n", "df_pivot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`.pivot()` will often get you what you want, but it won't work if you want to:\n", "- Use multiple indexes (next chapter), or\n", "- Have duplicate index/column labels\n", "\n", "In these cases you'll have to use `.pivot_table()`. I won't focus on it too much here because I'd rather you learn about `pivot()` first." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameDepartmentYearCourses
0TomCS20181
1TomSTATS20182
2MikeCS20183
3MikeSTATS20181
4TomCS20192
5TomSTATS20193
6MikeCS20194
7MikeSTATS20192
8TomCS20205
9TomSTATS20201
10MikeCS20202
11MikeSTATS20202
\n", "
" ], "text/plain": [ " Name Department Year Courses\n", "0 Tom CS 2018 1\n", "1 Tom STATS 2018 2\n", "2 Mike CS 2018 3\n", "3 Mike STATS 2018 1\n", "4 Tom CS 2019 2\n", "5 Tom STATS 2019 3\n", "6 Mike CS 2019 4\n", "7 Mike STATS 2019 2\n", "8 Tom CS 2020 5\n", "9 Tom STATS 2020 1\n", "10 Mike CS 2020 2\n", "11 Mike STATS 2020 2" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({\"Name\": [\"Tom\", \"Tom\", \"Mike\", \"Mike\"],\n", " \"Department\": [\"CS\", \"STATS\", \"CS\", \"STATS\"],\n", " \"2018\": [1, 2, 3, 1],\n", " \"2019\": [2, 3, 4, 2],\n", " \"2020\": [5, 1, 2, 2]}).melt(id_vars=[\"Name\", \"Department\"], var_name=\"Year\", value_name=\"Courses\")\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the above case, we have duplicates in `Name`, so `pivot()` won't work. It will throw us a `ValueError: Index contains duplicate entries, cannot reshape`:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "tags": [ "raises-exception" ] }, "outputs": [ { "ename": "ValueError", "evalue": "Index contains duplicate entries, cannot reshape", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m df.pivot(index=\"Name\",\n\u001b[1;32m 2\u001b[0m \u001b[0mcolumns\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"Year\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m values=\"Courses\")\n\u001b[0m", "\u001b[0;32m/opt/miniconda3/lib/python3.7/site-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36mpivot\u001b[0;34m(self, index, columns, values)\u001b[0m\n\u001b[1;32m 6669\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mpandas\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcore\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreshape\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpivot\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mpivot\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6670\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 6671\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mpivot\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mindex\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mindex\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcolumns\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalues\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 6672\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6673\u001b[0m _shared_docs[\n", "\u001b[0;32m/opt/miniconda3/lib/python3.7/site-packages/pandas/core/reshape/pivot.py\u001b[0m in \u001b[0;36mpivot\u001b[0;34m(data, index, columns, values)\u001b[0m\n\u001b[1;32m 475\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 476\u001b[0m \u001b[0mindexed\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_constructor_sliced\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_values\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mindex\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mindex\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 477\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mindexed\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0munstack\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 478\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 479\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/opt/miniconda3/lib/python3.7/site-packages/pandas/core/series.py\u001b[0m in \u001b[0;36munstack\u001b[0;34m(self, level, fill_value)\u001b[0m\n\u001b[1;32m 3888\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mpandas\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcore\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreshape\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreshape\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0munstack\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3889\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3890\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0munstack\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlevel\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfill_value\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3891\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3892\u001b[0m \u001b[0;31m# ----------------------------------------------------------------------\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/opt/miniconda3/lib/python3.7/site-packages/pandas/core/reshape/reshape.py\u001b[0m in \u001b[0;36munstack\u001b[0;34m(obj, level, fill_value)\u001b[0m\n\u001b[1;32m 423\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0m_unstack_extension_series\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mobj\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlevel\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfill_value\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 424\u001b[0m unstacker = _Unstacker(\n\u001b[0;32m--> 425\u001b[0;31m \u001b[0mobj\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mindex\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlevel\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mlevel\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mconstructor\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mobj\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_constructor_expanddim\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 426\u001b[0m )\n\u001b[1;32m 427\u001b[0m return unstacker.get_result(\n", "\u001b[0;32m/opt/miniconda3/lib/python3.7/site-packages/pandas/core/reshape/reshape.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, index, level, constructor)\u001b[0m\n\u001b[1;32m 118\u001b[0m \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Unstacked DataFrame is too big, causing int32 overflow\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 119\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 120\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_make_selectors\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 121\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 122\u001b[0m \u001b[0;34m@\u001b[0m\u001b[0mcache_readonly\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/opt/miniconda3/lib/python3.7/site-packages/pandas/core/reshape/reshape.py\u001b[0m in \u001b[0;36m_make_selectors\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 167\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 168\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mmask\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m<\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mindex\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 169\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Index contains duplicate entries, cannot reshape\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 170\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 171\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgroup_index\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcomp_index\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mValueError\u001b[0m: Index contains duplicate entries, cannot reshape" ] } ], "source": [ "df.pivot(index=\"Name\",\n", " columns=\"Year\",\n", " values=\"Courses\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In such a case, we'd use `.pivot_table()`. It will apply an aggregation function to our duplicates, in this case, we'll `sum()` them up:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Year201820192020
Name
Mike464
Tom356
\n", "
" ], "text/plain": [ "Year 2018 2019 2020\n", "Name \n", "Mike 4 6 4\n", "Tom 3 5 6" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pivot_table(index=\"Name\", columns='Year', values='Courses', aggfunc='sum')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we wanted to keep the numbers per department, we could specify both `Name` and `Department` as multiple indexes:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Year201820192020
NameDepartment
MikeCS342
STATS122
TomCS125
STATS231
\n", "
" ], "text/plain": [ "Year 2018 2019 2020\n", "Name Department \n", "Mike CS 3 4 2\n", " STATS 1 2 2\n", "Tom CS 1 2 5\n", " STATS 2 3 1" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pivot_table(index=[\"Name\", \"Department\"], columns='Year', values='Courses')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result above is a mutlti-index or \"hierarchically indexed\" dataframe (more on those next chapter). If you ever have a need to use it, you can read more about `pivot_table()` in the [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html#pivot-tables)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Working with Multiple DataFrames\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Often you'll work with multiple dataframes that you want to stick together or merge. `df.merge()` and `df.concat()` are all you need to know for combining dataframes. The Pandas [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html) is very helpful for these functions, but they are pretty easy to grasp.\n", "\n", "```{note}\n", "The example joins shown in this section are inspired by [Chapter 15](https://stat545.com/join-cheatsheet.html) of Jenny Bryan's STAT 545 materials.\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sticking DataFrames Together with `pd.concat()`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can use `pd.concat()` to stick dataframes together:\n", "- Vertically: if they have the same **columns**, OR\n", "- Horizontally: if they have the same **rows**" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "df1 = pd.DataFrame({'A': [1, 3, 5],\n", " 'B': [2, 4, 6]})\n", "df2 = pd.DataFrame({'A': [7, 9, 11],\n", " 'B': [8, 10, 12]})" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
012
134
256
\n", "
" ], "text/plain": [ " A B\n", "0 1 2\n", "1 3 4\n", "2 5 6" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
078
1910
21112
\n", "
" ], "text/plain": [ " A B\n", "0 7 8\n", "1 9 10\n", "2 11 12" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
012
134
256
078
1910
21112
\n", "
" ], "text/plain": [ " A B\n", "0 1 2\n", "1 3 4\n", "2 5 6\n", "0 7 8\n", "1 9 10\n", "2 11 12" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat((df1, df2), axis=0) # axis=0 specifies a vertical stick, i.e., on the columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the indexes were simply joined together? This may or may not be what you want. To reset the index, you can specify the argument `ignore_index=True`:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
012
134
256
378
4910
51112
\n", "
" ], "text/plain": [ " A B\n", "0 1 2\n", "1 3 4\n", "2 5 6\n", "3 7 8\n", "4 9 10\n", "5 11 12" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat((df1, df2), axis=0, ignore_index=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use `axis=1` to stick together horizontally:" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
01278
134910
2561112
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "0 1 2 7 8\n", "1 3 4 9 10\n", "2 5 6 11 12" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat((df1, df2), axis=1, ignore_index=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You are not limited to just two dataframes, you can concatenate as many as you want:" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AB
012
134
256
378
4910
51112
612
734
856
978
10910
111112
\n", "
" ], "text/plain": [ " A B\n", "0 1 2\n", "1 3 4\n", "2 5 6\n", "3 7 8\n", "4 9 10\n", "5 11 12\n", "6 1 2\n", "7 3 4\n", "8 5 6\n", "9 7 8\n", "10 9 10\n", "11 11 12" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat((df1, df2, df1, df2), axis=0, ignore_index=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Joining DataFrames with `pd.merge()`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`pd.merge()` gives you the ability to \"join\" dataframes using different rules (just like with SQL if you're familiar with it). You can use `df.merge()` to join dataframes based on shared `key` columns. Methods include:\n", "- \"inner join\"\n", "- \"outer join\"\n", "- \"left join\"\n", "- \"right join\"\n", "\n", "See this great [cheat sheet](https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_sql.html#compare-with-sql-join) and [these great animations](https://github.com/gadenbuie/tidyexplain) for more insights." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "df1 = pd.DataFrame({\"name\": ['Magneto', 'Storm', 'Mystique', 'Batman', 'Joker', 'Catwoman', 'Hellboy'],\n", " 'alignment': ['bad', 'good', 'bad', 'good', 'bad', 'bad', 'good'],\n", " 'gender': ['male', 'female', 'female', 'male', 'male', 'female', 'male'],\n", " 'publisher': ['Marvel', 'Marvel', 'Marvel', 'DC', 'DC', 'DC', 'Dark Horse Comics']})\n", "df2 = pd.DataFrame({'publisher': ['DC', 'Marvel', 'Image'],\n", " 'year_founded': [1934, 1939, 1992]})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](img/chapter8/join.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An \"inner\" join will return all rows of `df1` where matching values for \"publisher\" are found in `df2`:" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namealignmentgenderpublisheryear_founded
0MagnetobadmaleMarvel1939
1StormgoodfemaleMarvel1939
2MystiquebadfemaleMarvel1939
3BatmangoodmaleDC1934
4JokerbadmaleDC1934
5CatwomanbadfemaleDC1934
\n", "
" ], "text/plain": [ " name alignment gender publisher year_founded\n", "0 Magneto bad male Marvel 1939\n", "1 Storm good female Marvel 1939\n", "2 Mystique bad female Marvel 1939\n", "3 Batman good male DC 1934\n", "4 Joker bad male DC 1934\n", "5 Catwoman bad female DC 1934" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(df1, df2, how=\"inner\", on=\"publisher\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](img/chapter8/inner_join.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An \"outer\" join will return all rows of `df1` and `df2`, placing NaNs where information is unavailable:" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namealignmentgenderpublisheryear_founded
0MagnetobadmaleMarvel1939.0
1StormgoodfemaleMarvel1939.0
2MystiquebadfemaleMarvel1939.0
3BatmangoodmaleDC1934.0
4JokerbadmaleDC1934.0
5CatwomanbadfemaleDC1934.0
6HellboygoodmaleDark Horse ComicsNaN
7NaNNaNNaNImage1992.0
\n", "
" ], "text/plain": [ " name alignment gender publisher year_founded\n", "0 Magneto bad male Marvel 1939.0\n", "1 Storm good female Marvel 1939.0\n", "2 Mystique bad female Marvel 1939.0\n", "3 Batman good male DC 1934.0\n", "4 Joker bad male DC 1934.0\n", "5 Catwoman bad female DC 1934.0\n", "6 Hellboy good male Dark Horse Comics NaN\n", "7 NaN NaN NaN Image 1992.0" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(df1, df2, how=\"outer\", on=\"publisher\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](img/chapter8/outer_join.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Return all rows from `df1` and all columns of `df1` and `df2`, populated where matches occur:" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namealignmentgenderpublisheryear_founded
0MagnetobadmaleMarvel1939.0
1StormgoodfemaleMarvel1939.0
2MystiquebadfemaleMarvel1939.0
3BatmangoodmaleDC1934.0
4JokerbadmaleDC1934.0
5CatwomanbadfemaleDC1934.0
6HellboygoodmaleDark Horse ComicsNaN
\n", "
" ], "text/plain": [ " name alignment gender publisher year_founded\n", "0 Magneto bad male Marvel 1939.0\n", "1 Storm good female Marvel 1939.0\n", "2 Mystique bad female Marvel 1939.0\n", "3 Batman good male DC 1934.0\n", "4 Joker bad male DC 1934.0\n", "5 Catwoman bad female DC 1934.0\n", "6 Hellboy good male Dark Horse Comics NaN" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(df1, df2, how=\"left\", on=\"publisher\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](img/chapter8/left_join.png)" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namealignmentgenderpublisheryear_founded
0BatmangoodmaleDC1934
1JokerbadmaleDC1934
2CatwomanbadfemaleDC1934
3MagnetobadmaleMarvel1939
4StormgoodfemaleMarvel1939
5MystiquebadfemaleMarvel1939
6NaNNaNNaNImage1992
\n", "
" ], "text/plain": [ " name alignment gender publisher year_founded\n", "0 Batman good male DC 1934\n", "1 Joker bad male DC 1934\n", "2 Catwoman bad female DC 1934\n", "3 Magneto bad male Marvel 1939\n", "4 Storm good female Marvel 1939\n", "5 Mystique bad female Marvel 1939\n", "6 NaN NaN NaN Image 1992" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(df1, df2, how=\"right\", on=\"publisher\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are many ways to specify the `key` to join dataframes on, you can join on index values, different, column names, etc. Another helpful argument is the `indicator` argument which will add a column to the result telling you where matches were found in the dataframes:" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namealignmentgenderpublisheryear_founded_merge
0MagnetobadmaleMarvel1939.0both
1StormgoodfemaleMarvel1939.0both
2MystiquebadfemaleMarvel1939.0both
3BatmangoodmaleDC1934.0both
4JokerbadmaleDC1934.0both
5CatwomanbadfemaleDC1934.0both
6HellboygoodmaleDark Horse ComicsNaNleft_only
7NaNNaNNaNImage1992.0right_only
\n", "
" ], "text/plain": [ " name alignment gender publisher year_founded _merge\n", "0 Magneto bad male Marvel 1939.0 both\n", "1 Storm good female Marvel 1939.0 both\n", "2 Mystique bad female Marvel 1939.0 both\n", "3 Batman good male DC 1934.0 both\n", "4 Joker bad male DC 1934.0 both\n", "5 Catwoman bad female DC 1934.0 both\n", "6 Hellboy good male Dark Horse Comics NaN left_only\n", "7 NaN NaN NaN Image 1992.0 right_only" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(df1, df2, how=\"outer\", on=\"publisher\", indicator=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By the way, you can use `pd.concat()` to do a simple \"inner\" or \"outer\" join on multiple datadrames at once. It's less flexible than merge, but can be useful sometimes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. More DataFrame Operations\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Applying Custom Functions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There will be times when you want to apply a function that is not built-in to Pandas. For this, we also have methods:\n", "- `df.apply()`, applies a function column-wise or row-wise across a dataframe (the function must be able to accept/return an array)\n", "- `df.applymap()`, applies a function element-wise (for functions that accept/return single values at a time)\n", "- `series.apply()`/`series.map()`, same as above but for Pandas series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, say you want to use a numpy function on a column in your dataframe:" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TimeDistance
0-0.9018660.053604
1-0.9016970.447197
2-0.035549-0.046354
3-0.7390590.270228
4-0.236515-0.086263
.........
28-0.6833720.063586
290.1500560.133232
300.0267020.023627
31-0.0086400.221770
320.897861-0.700695
\n", "

33 rows × 2 columns

\n", "
" ], "text/plain": [ " Time Distance\n", "0 -0.901866 0.053604\n", "1 -0.901697 0.447197\n", "2 -0.035549 -0.046354\n", "3 -0.739059 0.270228\n", "4 -0.236515 -0.086263\n", ".. ... ...\n", "28 -0.683372 0.063586\n", "29 0.150056 0.133232\n", "30 0.026702 0.023627\n", "31 -0.008640 0.221770\n", "32 0.897861 -0.700695\n", "\n", "[33 rows x 2 columns]" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('data/cycling_data.csv')\n", "df[['Time', 'Distance']].apply(np.sin)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or you may want to apply your own custom function:" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Time
00.578889
10.703056
20.517500
30.608889
40.525278
......
280.519444
290.596944
300.511389
310.684167
320.511944
\n", "

33 rows × 1 columns

\n", "
" ], "text/plain": [ " Time\n", "0 0.578889\n", "1 0.703056\n", "2 0.517500\n", "3 0.608889\n", "4 0.525278\n", ".. ...\n", "28 0.519444\n", "29 0.596944\n", "30 0.511389\n", "31 0.684167\n", "32 0.511944\n", "\n", "[33 rows x 1 columns]" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def seconds_to_hours(x):\n", " return x / 3600\n", "\n", "df[['Time']].apply(seconds_to_hours)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This may have been better as a lambda function..." ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Time
00.578889
10.703056
20.517500
30.608889
40.525278
......
280.519444
290.596944
300.511389
310.684167
320.511944
\n", "

33 rows × 1 columns

\n", "
" ], "text/plain": [ " Time\n", "0 0.578889\n", "1 0.703056\n", "2 0.517500\n", "3 0.608889\n", "4 0.525278\n", ".. ...\n", "28 0.519444\n", "29 0.596944\n", "30 0.511389\n", "31 0.684167\n", "32 0.511944\n", "\n", "[33 rows x 1 columns]" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[['Time']].apply(lambda x: x / 3600)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can even use functions that require additional arguments. Just specify the arguments in `.apply()`:" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Time
034.733333
142.183333
231.050000
336.533333
431.516667
......
2831.166667
2935.816667
3030.683333
3141.050000
3230.716667
\n", "

33 rows × 1 columns

\n", "
" ], "text/plain": [ " Time\n", "0 34.733333\n", "1 42.183333\n", "2 31.050000\n", "3 36.533333\n", "4 31.516667\n", ".. ...\n", "28 31.166667\n", "29 35.816667\n", "30 30.683333\n", "31 41.050000\n", "32 30.716667\n", "\n", "[33 rows x 1 columns]" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def convert_seconds(x, to=\"hours\"):\n", " if to == \"hours\":\n", " return x / 3600\n", " elif to == \"minutes\":\n", " return x / 60\n", "\n", "df[['Time']].apply(convert_seconds, to=\"minutes\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some functions only accept/return a scalar:" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "int(3.141)" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "tags": [ "raises-exception" ] }, "outputs": [ { "ename": "TypeError", "evalue": "float() argument must be a string or a number, not 'list'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mfloat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m3.141\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m10.345\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mTypeError\u001b[0m: float() argument must be a string or a number, not 'list'" ] } ], "source": [ "float([3.141, 10.345])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For these, we need `.applymap()`:" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Time
02084
12531
21863
32192
41891
......
281870
292149
301841
312463
321843
\n", "

33 rows × 1 columns

\n", "
" ], "text/plain": [ " Time\n", "0 2084\n", "1 2531\n", "2 1863\n", "3 2192\n", "4 1891\n", ".. ...\n", "28 1870\n", "29 2149\n", "30 1841\n", "31 2463\n", "32 1843\n", "\n", "[33 rows x 1 columns]" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[['Time']].applymap(int)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, there are often \"vectorized\" versions of common functions like this already available, which are much faster. In the case above, we can use `.astype()` to change the dtype of a whole column quickly:" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "'astype' is 1.98 faster than 'applymap'!\n" ] } ], "source": [ "time_applymap = %timeit -q -o -r 3 df[['Time']].applymap(float)\n", "time_builtin = %timeit -q -o -r 3 df[['Time']].astype(float)\n", "print(f\"'astype' is {time_applymap.average / time_builtin.average:.2f} faster than 'applymap'!\")" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "raises-exception" ] }, "source": [ "### Grouping" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Often we are interested in examining specific groups in our data. `df.groupby()` allows us to group our data based on a variable(s)." ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateNameTypeTimeDistanceComments
010 Sep 2019, 00:13:04Afternoon RideRide208412.62Rain
110 Sep 2019, 13:52:18Morning RideRide253113.03rain
211 Sep 2019, 00:23:50Afternoon RideRide186312.52Wet road but nice weather
311 Sep 2019, 14:06:19Morning RideRide219212.84Stopped for photo of sunrise
412 Sep 2019, 00:28:05Afternoon RideRide189112.48Tired by the end of the week
.....................
284 Oct 2019, 01:08:08Afternoon RideRide187012.63Very tired, riding into the wind
299 Oct 2019, 13:55:40Morning RideRide214912.70Really cold! But feeling good
3010 Oct 2019, 00:10:31Afternoon RideRide184112.59Feeling good after a holiday break!
3110 Oct 2019, 13:47:14Morning RideRide246312.79Stopped for photo of sunrise
3211 Oct 2019, 00:16:57Afternoon RideRide184311.79Bike feeling tight, needs an oil and pump
\n", "

33 rows × 6 columns

\n", "
" ], "text/plain": [ " Date Name Type Time Distance \\\n", "0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 \n", "1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 \n", "2 11 Sep 2019, 00:23:50 Afternoon Ride Ride 1863 12.52 \n", "3 11 Sep 2019, 14:06:19 Morning Ride Ride 2192 12.84 \n", "4 12 Sep 2019, 00:28:05 Afternoon Ride Ride 1891 12.48 \n", ".. ... ... ... ... ... \n", "28 4 Oct 2019, 01:08:08 Afternoon Ride Ride 1870 12.63 \n", "29 9 Oct 2019, 13:55:40 Morning Ride Ride 2149 12.70 \n", "30 10 Oct 2019, 00:10:31 Afternoon Ride Ride 1841 12.59 \n", "31 10 Oct 2019, 13:47:14 Morning Ride Ride 2463 12.79 \n", "32 11 Oct 2019, 00:16:57 Afternoon Ride Ride 1843 11.79 \n", "\n", " Comments \n", "0 Rain \n", "1 rain \n", "2 Wet road but nice weather \n", "3 Stopped for photo of sunrise \n", "4 Tired by the end of the week \n", ".. ... \n", "28 Very tired, riding into the wind \n", "29 Really cold! But feeling good \n", "30 Feeling good after a holiday break! \n", "31 Stopped for photo of sunrise \n", "32 Bike feeling tight, needs an oil and pump \n", "\n", "[33 rows x 6 columns]" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('data/cycling_data.csv')\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's group this dataframe on the column `Name`:" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dfg = df.groupby(by='Name')\n", "dfg" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is a `DataFrameGroupBy` object? It contains information about the groups of the dataframe:\n", "\n", "![](img/chapter8/groupby_1.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The groupby object is really just a dictionary of index-mappings, which we could look at if we wanted to:" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'Afternoon Ride': [0, 2, 4, 6, 9, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32], 'Morning Ride': [1, 3, 5, 7, 8, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31]}" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dfg.groups" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also access a group using the `.get_group()` method:" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateNameTypeTimeDistanceComments
010 Sep 2019, 00:13:04Afternoon RideRide208412.62Rain
211 Sep 2019, 00:23:50Afternoon RideRide186312.52Wet road but nice weather
412 Sep 2019, 00:28:05Afternoon RideRide189112.48Tired by the end of the week
617 Sep 2019, 00:15:47Afternoon RideRide197312.45Legs feeling strong!
918 Sep 2019, 00:15:52Afternoon RideRide210112.48Pumped up tires
1019 Sep 2019, 00:30:01Afternoon RideRide4806212.48Feeling good
1220 Sep 2019, 01:02:05Afternoon RideRide296112.81Feeling good
1424 Sep 2019, 00:35:42Afternoon RideRide207612.47Oiled chain, bike feels smooth
1625 Sep 2019, 00:07:21Afternoon RideRide177512.10Feeling really tired
1826 Sep 2019, 00:13:33Afternoon RideRide186012.52raining
2027 Sep 2019, 01:00:18Afternoon RideRide171212.47Tired by the end of the week
221 Oct 2019, 00:15:07Afternoon RideRide1732NaNLegs feeling strong!
242 Oct 2019, 00:13:09Afternoon RideRide1756NaNA little tired today but good weather
263 Oct 2019, 00:45:22Afternoon RideRide172412.52Feeling good
284 Oct 2019, 01:08:08Afternoon RideRide187012.63Very tired, riding into the wind
3010 Oct 2019, 00:10:31Afternoon RideRide184112.59Feeling good after a holiday break!
3211 Oct 2019, 00:16:57Afternoon RideRide184311.79Bike feeling tight, needs an oil and pump
\n", "
" ], "text/plain": [ " Date Name Type Time Distance \\\n", "0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 \n", "2 11 Sep 2019, 00:23:50 Afternoon Ride Ride 1863 12.52 \n", "4 12 Sep 2019, 00:28:05 Afternoon Ride Ride 1891 12.48 \n", "6 17 Sep 2019, 00:15:47 Afternoon Ride Ride 1973 12.45 \n", "9 18 Sep 2019, 00:15:52 Afternoon Ride Ride 2101 12.48 \n", "10 19 Sep 2019, 00:30:01 Afternoon Ride Ride 48062 12.48 \n", "12 20 Sep 2019, 01:02:05 Afternoon Ride Ride 2961 12.81 \n", "14 24 Sep 2019, 00:35:42 Afternoon Ride Ride 2076 12.47 \n", "16 25 Sep 2019, 00:07:21 Afternoon Ride Ride 1775 12.10 \n", "18 26 Sep 2019, 00:13:33 Afternoon Ride Ride 1860 12.52 \n", "20 27 Sep 2019, 01:00:18 Afternoon Ride Ride 1712 12.47 \n", "22 1 Oct 2019, 00:15:07 Afternoon Ride Ride 1732 NaN \n", "24 2 Oct 2019, 00:13:09 Afternoon Ride Ride 1756 NaN \n", "26 3 Oct 2019, 00:45:22 Afternoon Ride Ride 1724 12.52 \n", "28 4 Oct 2019, 01:08:08 Afternoon Ride Ride 1870 12.63 \n", "30 10 Oct 2019, 00:10:31 Afternoon Ride Ride 1841 12.59 \n", "32 11 Oct 2019, 00:16:57 Afternoon Ride Ride 1843 11.79 \n", "\n", " Comments \n", "0 Rain \n", "2 Wet road but nice weather \n", "4 Tired by the end of the week \n", "6 Legs feeling strong! \n", "9 Pumped up tires \n", "10 Feeling good \n", "12 Feeling good \n", "14 Oiled chain, bike feels smooth \n", "16 Feeling really tired \n", "18 raining \n", "20 Tired by the end of the week \n", "22 Legs feeling strong! \n", "24 A little tired today but good weather \n", "26 Feeling good \n", "28 Very tired, riding into the wind \n", "30 Feeling good after a holiday break! \n", "32 Bike feeling tight, needs an oil and pump " ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dfg.get_group('Afternoon Ride')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The usual thing to do however, is to apply aggregate functions to the groupby object:\n", "\n", "![](img/chapter8/groupby_2.png)" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TimeDistance
Name
Afternoon Ride4654.35294112.462
Morning Ride2299.87500012.860
\n", "
" ], "text/plain": [ " Time Distance\n", "Name \n", "Afternoon Ride 4654.352941 12.462\n", "Morning Ride 2299.875000 12.860" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dfg.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can apply multiple functions using `.aggregate()`:" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TimeDistance
meansumcountmeansumcount
Name
Afternoon Ride4654.352941791241712.462186.9315
Morning Ride2299.875000367981612.860205.7616
\n", "
" ], "text/plain": [ " Time Distance \n", " mean sum count mean sum count\n", "Name \n", "Afternoon Ride 4654.352941 79124 17 12.462 186.93 15\n", "Morning Ride 2299.875000 36798 16 12.860 205.76 16" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dfg.aggregate(['mean', 'sum', 'count'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And even apply different functions to different columns:" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TimeDistance
maxminmeannum_rangesum
Name
Afternoon Ride4806217124654.35294146350186.93
Morning Ride290320902299.875000813205.76
\n", "
" ], "text/plain": [ " Time Distance\n", " max min mean num_range sum\n", "Name \n", "Afternoon Ride 48062 1712 4654.352941 46350 186.93\n", "Morning Ride 2903 2090 2299.875000 813 205.76" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def num_range(x):\n", " return x.max() - x.min()\n", "\n", "dfg.aggregate({\"Time\": ['max', 'min', 'mean', num_range], \n", " \"Distance\": ['sum']})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By the way, you can use aggregate for non-grouped dataframes too. This is pretty much what `df.describe` does under-the-hood:" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateNameTypeTimeDistanceComments
min1 Oct 2019, 00:15:07Afternoon RideRide1712.00000011.790000A little tired today but good weather
count33333333.00000031.00000033
meanNaNNaNNaN3512.78787912.667419NaN
num_rangeNaNNaNNaN46350.0000002.780000NaN
\n", "
" ], "text/plain": [ " Date Name Type Time \\\n", "min 1 Oct 2019, 00:15:07 Afternoon Ride Ride 1712.000000 \n", "count 33 33 33 33.000000 \n", "mean NaN NaN NaN 3512.787879 \n", "num_range NaN NaN NaN 46350.000000 \n", "\n", " Distance Comments \n", "min 11.790000 A little tired today but good weather \n", "count 31.000000 33 \n", "mean 12.667419 NaN \n", "num_range 2.780000 NaN " ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.agg(['mean', 'min', 'count', num_range])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dealing with Missing Values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Missing values are typically denoted with `NaN`. We can use `df.isnull()` to find missing values in a dataframe. It returns a boolean for each element in the dataframe:" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateNameTypeTimeDistanceComments
0FalseFalseFalseFalseFalseFalse
1FalseFalseFalseFalseFalseFalse
2FalseFalseFalseFalseFalseFalse
3FalseFalseFalseFalseFalseFalse
4FalseFalseFalseFalseFalseFalse
.....................
28FalseFalseFalseFalseFalseFalse
29FalseFalseFalseFalseFalseFalse
30FalseFalseFalseFalseFalseFalse
31FalseFalseFalseFalseFalseFalse
32FalseFalseFalseFalseFalseFalse
\n", "

33 rows × 6 columns

\n", "
" ], "text/plain": [ " Date Name Type Time Distance Comments\n", "0 False False False False False False\n", "1 False False False False False False\n", "2 False False False False False False\n", "3 False False False False False False\n", "4 False False False False False False\n", ".. ... ... ... ... ... ...\n", "28 False False False False False False\n", "29 False False False False False False\n", "30 False False False False False False\n", "31 False False False False False False\n", "32 False False False False False False\n", "\n", "[33 rows x 6 columns]" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.isnull()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But it's usually more helpful to get this information by row or by column using the `.any()` or `.info()` method:" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 33 entries, 0 to 32\n", "Data columns (total 6 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Date 33 non-null object \n", " 1 Name 33 non-null object \n", " 2 Type 33 non-null object \n", " 3 Time 33 non-null int64 \n", " 4 Distance 31 non-null float64\n", " 5 Comments 33 non-null object \n", "dtypes: float64(1), int64(1), object(4)\n", "memory usage: 1.7+ KB\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateNameTypeTimeDistanceComments
221 Oct 2019, 00:15:07Afternoon RideRide1732NaNLegs feeling strong!
242 Oct 2019, 00:13:09Afternoon RideRide1756NaNA little tired today but good weather
\n", "
" ], "text/plain": [ " Date Name Type Time Distance \\\n", "22 1 Oct 2019, 00:15:07 Afternoon Ride Ride 1732 NaN \n", "24 2 Oct 2019, 00:13:09 Afternoon Ride Ride 1756 NaN \n", "\n", " Comments \n", "22 Legs feeling strong! \n", "24 A little tired today but good weather " ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df.isnull().any(axis=1)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When you have missing values, we usually either drop them or impute them.You can drop missing values with `df.dropna()`:" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateNameTypeTimeDistanceComments
010 Sep 2019, 00:13:04Afternoon RideRide208412.62Rain
110 Sep 2019, 13:52:18Morning RideRide253113.03rain
211 Sep 2019, 00:23:50Afternoon RideRide186312.52Wet road but nice weather
311 Sep 2019, 14:06:19Morning RideRide219212.84Stopped for photo of sunrise
412 Sep 2019, 00:28:05Afternoon RideRide189112.48Tired by the end of the week
.....................
284 Oct 2019, 01:08:08Afternoon RideRide187012.63Very tired, riding into the wind
299 Oct 2019, 13:55:40Morning RideRide214912.70Really cold! But feeling good
3010 Oct 2019, 00:10:31Afternoon RideRide184112.59Feeling good after a holiday break!
3110 Oct 2019, 13:47:14Morning RideRide246312.79Stopped for photo of sunrise
3211 Oct 2019, 00:16:57Afternoon RideRide184311.79Bike feeling tight, needs an oil and pump
\n", "

31 rows × 6 columns

\n", "
" ], "text/plain": [ " Date Name Type Time Distance \\\n", "0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 \n", "1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 \n", "2 11 Sep 2019, 00:23:50 Afternoon Ride Ride 1863 12.52 \n", "3 11 Sep 2019, 14:06:19 Morning Ride Ride 2192 12.84 \n", "4 12 Sep 2019, 00:28:05 Afternoon Ride Ride 1891 12.48 \n", ".. ... ... ... ... ... \n", "28 4 Oct 2019, 01:08:08 Afternoon Ride Ride 1870 12.63 \n", "29 9 Oct 2019, 13:55:40 Morning Ride Ride 2149 12.70 \n", "30 10 Oct 2019, 00:10:31 Afternoon Ride Ride 1841 12.59 \n", "31 10 Oct 2019, 13:47:14 Morning Ride Ride 2463 12.79 \n", "32 11 Oct 2019, 00:16:57 Afternoon Ride Ride 1843 11.79 \n", "\n", " Comments \n", "0 Rain \n", "1 rain \n", "2 Wet road but nice weather \n", "3 Stopped for photo of sunrise \n", "4 Tired by the end of the week \n", ".. ... \n", "28 Very tired, riding into the wind \n", "29 Really cold! But feeling good \n", "30 Feeling good after a holiday break! \n", "31 Stopped for photo of sunrise \n", "32 Bike feeling tight, needs an oil and pump \n", "\n", "[31 rows x 6 columns]" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dropna()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or you can impute (\"fill\") them using `.fillna()`. This method has various options for filling, you can use a fixed value, the mean of the column, the previous non-nan value, etc:" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
0NaN2.0NaN0
13.04.0NaN1
2NaNNaNNaN5
3NaN3.0NaN4
\n", "
" ], "text/plain": [ " A B C D\n", "0 NaN 2.0 NaN 0\n", "1 3.0 4.0 NaN 1\n", "2 NaN NaN NaN 5\n", "3 NaN 3.0 NaN 4" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame([[np.nan, 2, np.nan, 0],\n", " [3, 4, np.nan, 1],\n", " [np.nan, np.nan, np.nan, 5],\n", " [np.nan, 3, np.nan, 4]],\n", " columns=list('ABCD'))\n", "df" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
00.02.00.00
13.04.00.01
20.00.00.05
30.03.00.04
\n", "
" ], "text/plain": [ " A B C D\n", "0 0.0 2.0 0.0 0\n", "1 3.0 4.0 0.0 1\n", "2 0.0 0.0 0.0 5\n", "3 0.0 3.0 0.0 4" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.fillna(0) # fill with 0" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
03.02.0NaN0
13.04.0NaN1
23.03.0NaN5
33.03.0NaN4
\n", "
" ], "text/plain": [ " A B C D\n", "0 3.0 2.0 NaN 0\n", "1 3.0 4.0 NaN 1\n", "2 3.0 3.0 NaN 5\n", "3 3.0 3.0 NaN 4" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.fillna(df.mean()) # fill with the mean" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
03.02.0NaN0
13.04.0NaN1
2NaN3.0NaN5
3NaN3.0NaN4
\n", "
" ], "text/plain": [ " A B C D\n", "0 3.0 2.0 NaN 0\n", "1 3.0 4.0 NaN 1\n", "2 NaN 3.0 NaN 5\n", "3 NaN 3.0 NaN 4" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.fillna(method='bfill') # backward (upwards) fill from non-nan values" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
0NaN2.0NaN0
13.04.0NaN1
23.04.0NaN5
33.03.0NaN4
\n", "
" ], "text/plain": [ " A B C D\n", "0 NaN 2.0 NaN 0\n", "1 3.0 4.0 NaN 1\n", "2 3.0 4.0 NaN 5\n", "3 3.0 3.0 NaN 4" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.fillna(method='ffill') # forward (downward) fill from non-nan values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, sometimes I use visualizations to help identify (patterns in) missing values. One thing I often do is print a heatmap of my dataframe to get a feel for where my missing values are. If you want to run this code, you may need to install `seaborn`:\n", "\n", "```sh\n", "conda install seaborn\n", "```" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [], "source": [ "import seaborn as sns\n", "sns.set(rc={'figure.figsize':(7, 7)})" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
0NaN2.0NaN0
13.04.0NaN1
2NaNNaNNaN5
3NaN3.0NaN4
\n", "
" ], "text/plain": [ " A B C D\n", "0 NaN 2.0 NaN 0\n", "1 3.0 4.0 NaN 1\n", "2 NaN NaN NaN 5\n", "3 NaN 3.0 NaN 4" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAagAAAGeCAYAAADBkZVwAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/d3fzzAAAACXBIWXMAAAsTAAALEwEAmpwYAAAM0ElEQVR4nO3dW4iVhf7G8d/SaUyywCwFAytr7w6eQM2MCEEqDykaFJQloSAJNaZ100XRTeBNCR0IkpDIDkqMOkiBdMIkGhOCosQKc5DQkLqJLkydmX0R2z/uvZv8gzPrqfX53Pm+I+9z4/r6rvUuptHf399fABBmWLMHAMD/IlAARBIoACIJFACRBAqASAIFQKS2obxY34//HMrL8T/MGz+t2ROoql1Hvmj2hJbn30KG9/re/sNz7qAAiCRQAEQSKAAiCRQAkQQKgEgCBUAkgQIgkkABEEmgAIgkUABEEigAIgkUAJEECoBIAgVAJIECIJJAARBJoACIJFAARBIoACIJFACRBAqASAIFQCSBAiCSQAEQSaAAiCRQAEQSKAAiCRQAkQQKgEgCBUAkgQIgkkABEEmgAIgkUABEEigAIgkUAJEECoBIAgVAJIECIJJAARBJoACIJFAARBIoACIJFACR2v7sBw4ePFi7du2qH3/8sYYNG1Zjx46tW265paZMmTIU+wBoUQPeQb3xxhv16KOPVlXVlClTatKkSVVV9eSTT9amTZsGfx0ALWvAO6jXXnutduzYUSNHjjzj+IoVK+rOO++slStXDuo4AFrXgHdQbW1tderUqf86fvz48TrvvPMGbRQADHgHtXr16lq6dGnddNNNdemll1aj0ahjx45Vd3d3rVu3bqg2AtCCBgzU4sWLa9asWfXpp5/WsWPHqq+vr2bOnFkdHR01bty4odoIQAv606f4xo0bV0uXLh2CKQDwf3wPCoBIAgVAJIECIJJAARBJoACIJFAARBIoACIJFACRBAqASAIFQCSBAiCSQAEQSaAAiCRQAEQSKAAiCRQAkQQKgEgCBUAkgQIgkkABEEmgAIgkUABEEigAIgkUAJEECoBIAgVAJIECIJJAARBJoACIJFAARBIoACIJFACRBAqASAIFQCSBAiCSQAEQSaAAiCRQAEQSKAAiCRQAkQQKgEgCBUCktqG82Lzx04bycgD8hbmDAiCSQAEQSaAAiCRQAEQSKAAiCRQAkQQKgEgCBUAkgQIgkkABEEmgAIgkUABEEigAIgkUAJEECoBIAgVAJIECIJJAARBJoACIJFAARBIoACIJFACRBAqASAIFQCSBAiCSQAEQSaAAiCRQAEQSKAAiCRQAkQQKgEgCBUAkgQIgkkABEEmgAIgkUABEEigAIgkUAJEECoBIAgVAJIECIJJAARBJoACIJFAARBIoACIJFACRBAqASG0DnTxy5MiAf3n8+PHndAwA/NuAgXrwwQerp6enxo4dW/39/WecazQa9cEHHwzqOABa14CBeuutt2rZsmX11FNP1YwZM4ZqEwAM/BnUqFGj6umnn64dO3YM0RwA+N2Ad1BVVVOnTq2pU6cOxRYAOM1TfABEEigAIgkUAJEECoBIAgVAJIECIJJAARBJoACIJFAARBIoACIJFACRBAqASAIFQCSBAiCSQAEQSaAAiCRQAEQSKAAiCRQAkQQKgEgCBUAkgQIgkkABEEmgAIgkUABEEigAIgkUAJEECoBIAgVAJIECIJJAARBJoACIJFAARBIoACIJFACRBAqASAIFQCSBAiCSQAEQSaAAiCRQAEQSKAAiCRQAkdqG8mK7jnwxlJcD4C/MHRQAkQQKgEgCBUAkgQIgkkABEEmgAIgkUABEEigAIgkUAJEECoBIAgVAJIECIJJAARBJoACIJFAARBIoACIJFACRBAqASAIFQCSBAiCSQAEQSaAAiCRQAEQSKAAiCRQAkQQKgEgCBUAkgQIgkkABEEmgAIgkUABEEigAIgkUAJEECoBIAgVAJIECIJJAARBJoACIJFAARBIoACIJFACRBAqASAIFQCSBAiDSnwbq/fffr82bN9fhw4fPOL5169ZBGwUAAwbqmWeeqddff716enrq3nvvra6urtPntmzZMujjAGhdbQOd3L17d23fvr3a2tpq+fLltXLlympvb68FCxZUf3//UG0EoAUNGKj+/v5qNBpVVXXFFVfUyy+/XCtWrKiLL7749HEAGAwDvsU3f/78Wr58eX355ZdVVfWPf/yjnnvuuVq7du1/fSYFAOfSgHdQDz/8cM2YMaMuuOCC08dmzJhR27Ztq02bNg36OABaV6N/CD9M6vvxn0N1KYABzRs/rdkTqKr3+t7+w3O+BwVAJIECIJJAARBJoACIJFAARBIoACIJFACRBAqASAIFQCSBAiCSQAEQSaAAiCRQAEQSKAAiCRQAkQQKgEgCBUAkgQIgkkABEEmgAIgkUABEEigAIgkUAJEECoBIAgVAJIECIJJAARBJoACIJFAARBIoACIJFACRBAqASAIFQCSBAiCSQAEQSaAAiCRQAEQSKAAiCRQAkQQKgEgCBUAkgQIgkkABEKmt2QMYWvPGT2v2BKpq15Evmj0B4rmDAiCSQAEQSaAAiCRQAEQSKAAiCRQAkQQKgEgCBUAkgQIgkkABEEmgAIgkUABEEigAIgkUAJEECoBIAgVAJIECIJJAARBJoACIJFAARBIoACIJFACRBAqASAIFQCSBAiCSQAEQSaAAiCRQAEQSKAAiCRQAkQQKgEgCBUAkgQIgkkABEEmgAIgkUABEEigAIgkUAJEECoBIAgVAJIECIJJAARBJoACIJFAARGr7sx/o6empkSNH1rhx4+rtt9+ub775pqZPn14LFy4cin0AtKgBA/Xqq6/W5s2bq6+vr2bPnl1Hjx6t2267rTo7O+vQoUP10EMPDdVOAFrMgIHq7Oysd999t3766adatGhRdXd314gRI+ruu++uu+66S6AAGDQDfgbV19dX7e3tddlll9XKlStrxIgRp8/19vYO+jgAWteAgbr99tvr/vvvr97e3uro6KiqqgMHDtSyZctqwYIFQzIQgNY04Ft8jzzySO3bt6+GDx9++lh7e3t1dHTUnDlzBn0cAK3rT5/iu+GGG87488SJE2vixImDNggAqnwPCoBQAgVAJIECIJJAARBJoACIJFAARBIoACIJFACRBAqASAIFQCSBAiCSQAEQSaAAiCRQAEQSKAAiCRQAkQQKgEgCBUAkgQIgkkABEEmgAIgkUABEEigAIgkUAJEECoBIAgVAJIECIJJAARBJoACIJFAARBIoACIJFACRBAqASAIFQCSBAiCSQAEQSaAAiCRQAEQSKAAiCRQAkQQKgEgCBUAkgQIgUqO/v7+/2SMA4D+5gwIgkkABEEmgAIgkUABEEigAIgkUAJEECoBIAgVAJIECIFJbswf8lXz77be1ePHiev7552vevHnNntNS9u7dW6tXr64JEyZUf39/nTx5su6555564IEHmj2tpfz666/17LPP1r59+2r48OF10UUX1eOPP16TJk1q9rSW8cMPP9T8+fPrqquuqqqq48eP1/Tp0+uxxx6rSy65pMnrzi2B+n/o7Oys+fPn19atWwWqCSZPnlybN2+uqt9fKO+44466+eab6+qrr27ystbQ19dXq1atqhtvvLF27NhRbW1t1d3dXatWrap33nmnRo8e3eyJLWPs2LHV1dVVVVX9/f21YcOGWrNmTb355ptNXnZueYvvLJ08ebJ27txZa9eura+//roOHz7c7Ekt7bfffqvhw4fXhRde2OwpLWPv3r119OjRWrNmTbW1/f5/29mzZ9f69eurr6+vyetaV6PRqI6Ojvruu+/qwIEDzZ5zTgnUWdq9e3eNHz++rrzyyrr11ltr69atzZ7Ucr766qtasmRJLV68uObOnVuzZs2qsWPHNntWy9i/f39de+21NWzYmS8bc+bMqTFjxjRpFVVV7e3tdfnll9f333/f7CnnlECdpc7Ozlq0aFFVVS1cuLC2bdtWJ06caPKq1jJ58uTq6uqqnTt31ieffFI9PT21cePGZs9qGcOGDasRI0Y0ewZ/oNFo1Pnnn9/sGeeUQJ2Fn3/+ufbs2VObNm2quXPn1hNPPFG//PJLvffee82e1rJGjRpVCxYsqM8//7zZU1rG5MmTa//+/fWfv6Fnw4YN1d3d3aRVVFWdOHGiDh069Lf7PFagzkJXV1fNnj27Pv744/rwww/ro48+qtWrV9eWLVuaPa1l9fb21meffVbXX399s6e0jJkzZ9aYMWPqxRdfrN7e3qqq2rNnT23btu1v98L4V9LX11cvvPBCTZs2rSZMmNDsOeeUp/jOwvbt22vdunVnHLvvvvvqlVdeqYMHD55+3JPB9e/PoBqNRp06daquueaaWrVqVbNntYxGo1EvvfRSrV+/vhYtWlRtbW01evTo2rhx49/u8eZ0x44dqyVLllTV74G67rrrasOGDU1ede75jboARPIWHwCRBAqASAIFQCSBAiCSQAEQSaAAiCRQAEQSKAAi/QvXS9vP2cvFXgAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.heatmap(df.isnull(), cmap='viridis', cbar=False);" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAaoAAAGgCAYAAAD7FCOQAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/d3fzzAAAACXBIWXMAAAsTAAALEwEAmpwYAAA14UlEQVR4nO3df1xUdb4/8BfyK4hcrwniVwlbNxavP7AolVDRkhEYxiExBLwgElcoA9Fupiwrrg9F8bqRrj32EQ/RLZEECSVpxQhdr4ir6S3YslzzV3olAilxVmRgON8/vM5VFx04c8544LyefzUH/Mxrfhzenc/nnPexEwRBABERkUL1e9gBiIiIHoSFioiIFI2FioiIFI2FioiIFI2FioiIFI2FioiIFM2qQrV3716EhYVBo9Fgx44dUmUiIiIycxD7DxsaGpCbm4vS0lI4OTkhOjoaEyZMwK9+9Ssp8xERkcqJLlQ1NTWYOHEiBgwYAACYMWMGKioq8Prrr1v8t8H9Xhb7tERE1EdVdu7qcrvoqb8ff/wR7u7u5sceHh5oaGgQOxwREVGXRBeqzs5O2NnZmR8LgnDXYyIiIimInvrz9PTEiRMnzI8bGxvh4eEhSSjqffZfqZVsrBn/z0+ysYio9xN9RPX888/j6NGjaG5uRmtrKz799FNMmTJFymxERETij6gGDx6MxYsXIz4+Hu3t7Zg9ezbGjh0rZTYiIiLxhQoAdDoddDqdVFmIiIj+iVWFimxLyetAXFei3krK/QrgviAHtlAiIiJFY6EiIiJFs6pQbdy4EWFhYdBqtdi2bZtUmYiIiMxEr1EdP34cf/3rX/Hxxx+jo6MDYWFhCAoKwi9/+Usp89EdpJz75rw82ZKSv29q+u4q+XN4ENFHVOPHj8cHH3wABwcHXL16FSaTCa6urlJmIyIism7qz9HREZs2bYJWq0VAQAAGDx4sVS4iIiIAEpxMkZaWhqNHj6K+vh7FxcVSZCIiIjITvUZ19uxZGI1GjBw5Ei4uLtBoNDh9+rSU2Yioj1DTOpCS9dbPQfQR1eXLl5GZmQmj0Qij0Yiqqir4+/tLmY2IiEj8EVVQUBDq6uoQEREBe3t7aDQaaLVaKbMRERHBThAEwdZPyjv8Um8l9em9Uuqt0zpEt0l+h18iIiJbYKEiIiJFs6pQHThwALNmzUJoaChWr14tVSYiIiIz0SdTXLp0CVlZWdi1axcef/xxzJs3D4cOHUJQUJCU+YgURcnrQEpuj8O1vb7JVp+r6EJVWVmJsLAweHp6AgByc3Ph7OwsWTAiIiLAiqm/ixcvwmQyISUlBXq9HoWFhfjFL34hZTYiIiLxhcpkMuHo0aPIzs5GUVER6urqsHv3bimzERERiZ/6GzRoEAICAjBw4EAAwPTp01FXV4dZs2ZJFo7ko+T1DOp7+P1QBqXv95WdXW8XfUQ1bdo0VFdXo6WlBSaTCYcPH8aoUaPEDkdERNQl0UdUfn5+SEpKQmxsLNrb2xEYGIjIyEgpsxEREYkvVAAwe/ZszJ49W6osRERE/4S9/u4h5Rwu5+WJblH62ggpA3v9ERFRr2RVocrLy8OMGTOg0+nwxz/+UapMREREZqILVU1NDfbu3YuPPvoIe/bsQW1tLT799FMpsxEREYk/meLUqVOYNGkS3NzcAACTJ0/GZ599Bo1GI1m4h4H9zciW1LImKnU2tbxvdIvoI6pRo0ahuroaP//8M9ra2nDgwAE0NTVJmY2IiEj8EVVAQABmzZqFuLg4DBgwAAEBAaitVe4RBBER9U6iT083GAy4du0ahg4dCgDYsmULfvjhB2RmZlr8t1Kens7TXomI+gbJT0+/fPkyXnvtNXR0dOD69esoKSlBaGio6IBERERdET315+vrC41Gg5kzZ8JkMiEhIQH+/v5SZiMiIur9nSk49UdE1Dfcb+rPql5/SsDCQkTUt7GFEhERKRoLFRERKVq3CpXBYEB4eDguX74MACgqKkJ4eDh0Oh2WL18Oo9Eoa0giIlIvi2tUtbW1yMzMxIULFwAA58+fR35+PkpLS/Hoo49i2bJlKCwsREJCgsxRbYOtWai34olFfQ8/01ssHlEVFxcjKysLHh4eAAAnJydkZWXBzc0NdnZ28PHxwZUrV2QPSkRE6mTxiGrNmjV3PR46dKi5G0VzczN27NiBtWvXypOOiIhUT/TJFA0NDZg3bx4iIyMxYcIEKTMRERGZibqO6uzZs0hKSkJcXBwSExOlzvRQ9dY53J5S8tw3sz38sdRETZ+pkl/rg/S4UBkMBrzyyitIT09HRESEDJGIiIj+T4+n/kpKStDU1IRt27ZBr9dDr9dj48aNcmQjIiLq/hHVgQMHAAAJCQl95lR0IiJSvl7f64/EUfJ6hpqy8bo9IsvYQomIiBRNVAul5cuXQ6PRmNeoKisrZQ1JRETq1eMWSgDw1VdfoaCgwNytgkjJU1hKzsbpuodPTZ9Bb32tPW6h1NraiitXriAjIwM6nQ6bNm1CZ2en7EGJiEidLBaqNWvW4NlnnzU/bmpqwsSJE5GdnY3i4mKcOHECJSUlsoYkIiL16vHJFF5eXnj33Xfh4eEBFxcXxMXF4dChQ3JkIyIi6vnp6adPn8aFCxcwY8YMAIAgCHBweHhnuffWliB9jZLfNyVnkxL3BeqrenxEJQgCsrOzce3aNbS3t6OoqAjBwcFyZCMiIur5EZWvry8WLFiAmJgYdHR0QKPRIDw8XI5sREREPW+hBABz587F3LlzZQlERER0p17fQonz6H2PktdalJyN+wL1VWyhREREisZCRUREiiaq1191dTVmzpyJ8PBwLF26FEajUdaQRESkXqJ6/f3mN7/B1q1bMWLECKSlpaGsrAwvv/yynDkJyu5ZJyU19fojIst63OsPAEwmEwwGA0wmE9ra2uDs7CxrSCIiUi+LR1Rr1qz5p20rV65EXFwc3NzcMGzYMISEhMgSjoiIqMcnUzQ2NmLDhg0oLy9HdXU1/Pz8sHbtWjmyERER9fw6qhMnTsDHxwdPPPEEACAqKgrp6elS56IucH1EHL5v9CBKvjZOar31tfb4iMrHxwd1dXVoamoCAFRVVWHMmDGSByMiIgJEHFGNGDECixYtQnx8POzt7eHt7Y1Vq1bJkY2IiAh2giAItn7S4H7qOJVd6sNsKSl5eoKIblH6VJ3U+fp5/r3r7ZI+CxERkcRYqIiISNG6Vag2b94MrVYLrVaL9evXm7e3t7dj3rx5OHbsmGwBiYhI3SyeTFFTU4Pq6mrs3r0bdnZ2SEpKQmVlJUaMGIGMjAycOnXKFjnvS8lzuFwHIqK+TOq/cZWdXW+3eETl7u6OZcuWwcnJCY6OjhgxYgSuXLmCkpISJCUlwc+Pf4yJiEg+Fo+onnrqKfN/X7hwAfv27cOHH36I4cOHAwDef/992cIRERF1+2SKM2fOIDExEUuXLjUXKSIiIrl164LfkydPIi0tDRkZGdBqtXJn6hGuAxFRX8W/b7dYLFT19fVYuHAhcnNzERAQYItMREREZhYLVX5+Ptra2rBu3TrztujoaMTExMgajIiICGALJSIiUojKzl1dbu9xU1rqG5R8/RmRJVJ+f/ndVT62UCIiIkVjoSIiIkWzqtcfABQUFCAuLk6WcERERKJ7/QUHB+O7775DXl4evL29bZGVJMR5ebJEyetA/P6qi+hef0ajEStWrEBaWpotchIRkUqJ7vX3+9//HpGRkRg2bJisAYmISN26fXr6mTNnkJycjKVLl+J//ud/UF9fj+XLl/NeVEQKoeRLDpScTWpSv1bq5skUJ0+eREJCAt544w289NJLKC8vx5kzZ6DX65GZmYmvvvoK6enpMkclIiI1EtXrb+3ateafHzt2DJs3b8Y777wjW0giIlIv9vojIiJFY68/IpKdmtao1ETqz7Wf59+73i7psxAREUmMhYqIiBStW6enb968Gfv27QMABAUFYcKECXj77bfNP29oaICfnx/ee+89eVISEZFqiWqh9PTTT6OsrAwA0NjYiJiYGCxfvlz2sER0f0peB+KaUt8k9eda2dn1douF6s4WSgDMLZRuW79+PaKjozF8+HBJghIREd3J4hrVU089hXHjxgH4vxZKQUFB5sfHjx9HfHy8rCGJiEi9un0yxZkzZ5CYmIilS5eaj56KiooQGxtrPtoiIiKSWrdOpjh58iTS0tKQkZEBrVZr3l5VVYX8/HzZwhFR93EdiPoqUS2UAKC5uRk3b96El5eXrAGJiEjdRLdQGjVqFDw9PWUNR0RExBZKRESkCJWdu7rc3u37USmVku/9oqY1AyXftpyIeje2UCIiIkXrVqHauHEjwsLCoNVqsW3bNgC3OlbodDpoNBrk5ubKGpKIiNTL4tTf8ePH8de//hUff/wxOjo6EBYWhoCAAGRkZGD79u0YMmQIkpOTcejQIfOFwLbEaSJl4OdARHKxeEQ1fvx4fPDBB3BwcMDVq1dhMpnQ0tICb29veHl5wcHBATqdDhUVFbbIS0REKtOtqT9HR0ds2rQJWq0WAQEB+PHHH+Hu7m7+uYeHBxoaGmQLSURE6tXtkynS0tJw9OhR1NfX48KFC7CzszP/TBCEux4TERFJxeIa1dmzZ2E0GjFy5Ei4uLhAo9GgoqIC9vb25t9pbGyEh4eHrEF7IyXfdoHEUdNnyksOSCksHlFdvnwZmZmZMBqNMBqNqKqqQnR0NM6fP4+LFy/CZDKhvLwcU6ZMsUVeIiJSGYtHVEFBQairq0NERATs7e2h0Wig1WoxcOBApKamoq2tDUFBQQgJCbFFXiIiUpludaZITU1FamrqXdsCAgLw8ccfyxKKiIjotl7f609NawZED6LkfUHJ2Ug8qT/Xfp5/73q7pM9CREQkMRYqIiJSNNG9/goLC6HVahEWFoacnBw8hBlEIiJSAVG9/oKCgvCnP/0Je/bsgbOzM+bOnYsjR45g0qRJtsh8F85VE93CfYH6KlG9/lxdXfHJJ5/A1dUVLS0tMBgM6N+/vy3yEhGRyojq9Td48GA4OjqiuLgY06dPh7u7O3x9feXOSkREKiSq119xcTEAICoqCseOHcOgQYOwefNm2UISEZF6ier1V1tbi1/96lfw9/eHg4MDtFotPvzwQ1vkJYVScl84JWdTCyW/b7zGSzypX2tlZ9fbRfX6GzZsGN588020tLRAEATs378f/v7+kgYmIiICRPb6e+211zBw4EBER0fD3t4ezz77LObPn2+LvEREpDK9voUSEd3CKSzq7So7d3W5nZ0piIhI0VioiIhI0US3UPriiy8QFRUFrVaLJUuWwGg0yhqUiIjUSVQLpYCAAKSmpmLLli3w9fXFkiVLUFJSgtjYWFtklhVPZX74uNYijlpeJ8D9VG0sFqo7Wyg1NDTAZDLhm2++wbhx48zdKDIzM2EymWQPS0RE6iOqhVJjYyNcXV2xePFi6PV6/OEPf2CvPyIikoWoFkpGoxHV1dVYsmQJSktL0drairy8PDlzEhGRSolqoZSTk4PAwEB4eXkBAEJDQ1FQUCB7WFtQy3w114GoN1PL94376S2iWiitWrUKX3/9Nerr6wEABw8exKhRo2QPS0RE6iOqhVJERAQGDBiAlJQUtLW1YeTIkXjrrbdskZeIiFTGYqECgNTUVKSmpt61berUqZg6daocmYiIiMweSq+/zh98JBurt865EhHR3djrj4iIeiUWKiIiUjTRvf5KS0sRFhYGnU6H1atXo6OjQ9agRESkTqJ6/QUFBeGdd95BSUkJPDw8sHLlSmzfvr3bN0/kuhIREXWXxSOqO3v9Xb16FSaTCXV1dRg3bhw8PDwAANOmTcNnn30me1giIlIfUb3+xo4di9raWtTX18NkMqGiogJNTU1yZyUiIhXq0enpra2tSElJQVhYGFxcXLB161Y88sgjCAkJQUlJCcrLy7s1jlpuRc/2J0S9A28bogyiT08/e/YsvvnmGwAw9/qrq6vD2LFjsWfPHuzcuRODBw829/0jIupNpP4fSpKeqF5/EyZMQEJCAgwGA4xGIwoKChAWFmaLvEREpDKiev3NnDkTbW1tmDNnDjo6OhAeHg6dTmeLvEREpDIPpYWSWtaoiEgeXFMSR+nr5myhREREvRILFRERKVq3C1VOTg6WLVsGACgqKjKvSy1fvhxGo1G2gEREpG7duh/V0aNHsXv3bkydOhXnz59Hfn4+SktL8eijj2LZsmUoLCxEQkKCzFG7pvQ5VyKSHvdTcXrr+2bxiOrnn39Gbm4uUlJSAABOTk7IysqCm5sb7Ozs4OPjgytXrsgelIiI1MlioVqxYgUWL16M/v37AwCGDh2KwMBAAEBzczN27NiBF198Ud6URESkWg8sVLt27cKQIUMQEBDwTz9raGjAvHnzEBkZiQkTJsgWkIiI1O2B11HNnz8fjY2NsLe3x7Vr13Djxg1ERERgzpw5SEpKQlxcHBITE3v8pLyOioiI7nW/66geeDLF7ZskArdulHj8+HGkpaUhPDwc6enpiIiIkDQkERHRvXp8HVVJSQmampqwbds26PV66PV6bNy4UY5sREREbKFERETKIGrqj4hIidTS60/ptyCx1XvHFkpERKRo3T6iysnJwU8//YR169Zh+fLlOHnyJFxcXAAAr7/+OoKDg2ULSURE6tXjFkoA8NVXX6GgoAAeHh5yZiMiIrJcqO5sofTtt9+itbUVV65cQUZGBhoaGhAcHIzXX38d/fo9nFlE9vojS7ieIY6SX6uSsymZ1O+brdbQetxCqampCRMnTkR2djaKi4tx4sQJlJSUyB6UiIjUqcctlLy8vPDuu+/Cw8MDLi4uiIuLw6FDh2QPSkRE6vTAqb8///nPaGxshF6vN7dQWrhwIWbOnIkZM2YAAARBgIPDwzvLnVMAZAm/I9RbKf27K/2t6Lve3uMWSgkJCUhOTsbEiRPh6uqKoqIivPTSS5KGJSIiuq3Hh0K+vr5YsGABYmJi0NHRAY1Gg/DwcDmyERERsYUSUV+hprP+qG9iC6WHgH84iOShlksO6Ba2UCIiIkVjoSIiIkXrdqHKycnBsmXLAADV1dWYOXMmwsPDsXTpUhiNRtkCEhGRuonq9feb3/wGW7duxYgRI5CWloaysjK8/DJPkLgX577JEq61iKOm10rdOKK6s9ffbSaTCQaDASaTCW1tbXB2dpY1JBERqZfFI6rbvf7q6+vN21auXIm4uDi4ublh2LBhCAkJkTUkERGpV497/TU2NmLDhg0oLy9HdXU1/Pz8sHbtWtmDEhGROj3wgt/58+ejsbER9vb25l5/bm5uGD16NDZt2gQAOHPmDNLT0/HJJ590+0k7f/CxPvn/UnLbes6jKwM/U6LeQdQFv131+vv3f/93vPLKK2hqasKgQYNQVVWFMWPGSJuWiIjof/W4M8WIESOwaNEixMfHw97eHt7e3li1apUc2YiIiB5Orz9O/ZEt8TMl6h0U1etPyTu7krNJiX0IichairkVPRER0cPUrUIVFxcHrVYLvV4PvV6P2tpbVbS9vR3z5s3DsWPHZA1JRETqZXHqTxAEXLhwAQcPHrzrlvPnzp1DRkYGTp06JWtAkoeSp+o4LUnUO9jqVvQWj6jOnTsHAEhMTMTMmTNRUFAAACgpKUFSUhL8/PhHgIiI5GPxiKqlpQUBAQH47W9/i/b2dsTHx+PJJ5/E0qVLAQDvv/++7CGJiEi9LBaqp59+Gk8//bT58ezZs3Ho0CEEBgbKGoyIiAjoRqE6ceIE2tvbzf3+BEG4a62qr7HV6ZZiqGWtRS2vk8TjtXHqYnGN6vr161i/fj3a2tpgMBiwe/duBAcH2yIbERGR5SOqadOmoba2FhEREejs7ERsbOxdU4FERERyeigtlIL7KfduwJz6I1I+Tv31TYpqoaRk/NISKR/3U3VhCyUiIlI0FioiIlK0bk39xcXFobm52Xxa+qpVq8wdKQoKCrB//35s375dvpRERKRaonv9AcB3332HvLw8eHt7yxaQ1EfJJ7RITS1rLezfSNYQ3evPaDRixYoVSEtLkzchERGpmuhef//1X/+FyMhIDBs2zBY5iYhIpUT1+vvP//xPPPHEE1i+fDnvRUVERLIS1etv5MiR+PLLL6HX63Hjxg00NTUhPT0d77zzjtx5VU3JazdSrhlIvf7Ai0OJejdRvf6ioqKwb98+lJWVYfXq1Rg9ejSLFBERyYK9/oiISNHY64+IuqTkKVO1TIMrndSfQz/Pv3e9XdJnISIikhgLFRERKZqoFkrBwcGorKw0/7yhoQF+fn5477335ElJRESqZXGNShAETJkypcsWSgDQ2NiImJgYbNmyBcOHD+/Wk3KN6uFjSxuyJX7fqDvudz8q0S2Ublu/fj2io6O7XaSIiIh6QnQLpcDAQFy4cAHHjx/HmjVrbJGViIhUSFQLpUOHDiEwMBBFRUWIjY2Fk5OTrCGJiEi9RLVQur1WVVVVhfz8fHkTkpmSrx0hehCuKZE1RLVQCg4ORnNzM27evAkvLy9b5CQiIpUS3UKprq4Onp6etshIREQqxhZKvYiSp/44tUNE1rrf6enduuCXlEEtxUBN19wouZ8ePXxq2hcehC2UiIhI0VioiIhI0bpVqA4cOIBZs2YhNDQUq1evBgDU1NRAp9NBo9EgNzdX1pBERKReFk+muHTpEmJjY7Fr1y48/vjjmDdvHpKTk5GVlYXt27djyJAhSE5ORnx8PIKCgrr1pFKeTME5XCJ5cP2MLFHM/agqKysRFhYGT09PODo6Ijc3Fy4uLvD29oaXlxccHByg0+lQUVEhaWAiIiKgG2f9Xbx4EY6OjkhJSUF9fT2mTp2Kp556Cu7u7ubf8fDwQENDg6xBiYhInSwWKpPJhBMnTmD79u1wdXXFq6++ikceeQR2dnbm3xEE4a7HRGR7Sr7Ojvomqad0Kzu73m6xUA0aNAgBAQEYOHAgAGD69OmoqKiAvb29+XcaGxvh4eEhTVIiIqI7WFyjmjZtGqqrq9HS0gKTyYTDhw8jJCQE58+fx8WLF2EymVBeXo4pU6bYIi8REamMxSMqPz8/JCUlITY2Fu3t7QgMDERMTAx++ctfIjU1FW1tbQgKCkJISIgt8hIRkcqw1x8pjpouOeAp4PQgSl93lH6NSuSt6ImIiB4mFioiIlI00S2UCgsLodVqERYWhpycHDyEGUQiIlIBUS2UwsPD8ac//Ql79uyBs7Mz5s6di9dffx2TJk3q1pNyjYpI+bh+Jo6a3jdbtVCyeNbfnS2UACA3NxfOzs54+eWX4ejoiJ9++gkGgwH9+/eXNDARERHQjam/29dKpaSkQK/Xo7CwEL/4xS/g6OiI4uJiTJ8+He7u7vD19bVFXiIiUhmLhcpkMuHo0aPIzs5GUVER6urqsHv3bgBAVFQUjh07hkGDBmHz5s2yhyUiIvUR1ULp888/h7e3N/z9/eHg4ACtVosPP/xQ9rC9jZquB1IyNa0ZSElNr1VKUr5vSv8bYqtef6JaKPn6+uLNN99ES0sLBEHA/v374e/vL2lgIiIiQGQLpbi4ODg7OyM6Ohr29vZ49tlnMX/+fFvkJSIilbFYqABg9uzZmD179l3boqOjER0dLUsoIiKi27pVqEgczvFTb8a1PXH4vkmPLZSIiEjRRLdQ+uKLLxAVFQWtVoslS5bAaDTKGpSIiNTJ4tTfpUuXkJWVdVcLpc8++wwrV67Eli1b4OvriyVLlqCkpASxsbG2yCwrHrb3PWr5HJR+KjP13P4rtfwcILKF0pdffolx48aZu1FkZmbCZDLJm5SISGVYpG6xWKguXrwIR0dHpKSkoL6+HlOnTsWjjz4KV1dXLF68GOfOncMzzzyDZcuW2SIvERGpjKgWSiaTCdXV1ViyZAlKS0vR2tqKvLw8W+QlIiKVEdVCKScnB4GBgfDy8gIAhIaGoqCgQN6k98F5eaJbpP7ucr1WHDW9VlsR1UJpwYIF+Prrr1FfXw8AOHjwIEaNGiV7WCIiUh9RLZRee+01jB49GikpKWhra8PIkSPx1ltv2SIvERGpjOgWSlOnTsXUqVPlyERERGTW61socT6YSB7ct0gp2EKJiIgUjYWKiIgUTXSvv9LSUoSFhUGn02H16tXo6OiQNSgREamTnSAIwoN+4dKlS4iNjb2r19+MGTOQn5+PkpISeHh4YOXKlfD29u72zRM7f/CRJLwclDwvz+taiKgvq+zc1eV2i0dUd/b6c3R0RG5uLtzd3TFu3Dh4eHgAuHWt1WeffSZtYiIiInSjUF28eBEmkwkpKSnQ6/UoLCyEr68vamtrUV9fD5PJhIqKCjQ1NdkiLxERqYzF09NNJhNOnDiB7du3w9XVFa+++iq8vb3xxhtv4NVXX8UjjzyCkJAQ/O1vf7NFXiIiUhlRvf4+//xzJCcnY8+ePQCAffv2mfv+dQfXR8Th+0YkPfYLVT5Rvf6eeuopJCQkwGAwwGg0oqCgAGFhYbbIS0REKiOq119CQgIee+wxzJkzBx0dHQgPD4dOp7NFXiIiUhmLp6fLIbjfy7Z+SiLqQ3ipRt8k+vR0IiKih4mFioiIFM3iGtWuXbvuunvv5cuXodfr8etf/xrbt2+HnZ0dRo8ejd/97ndwcnKSNSwREalPj9aozpw5g4ULF2LDhg34j//4D5SWluLRRx/FsmXLMHLkSCQkJHRrHK5RkS0peT1Dydno4ZP61HmpSf2du98aVY/uR7Vy5UosXrwYjz/+OLKysuDm5gYA8PHxwZUrV6xPSUREdI9uF6qamhrcvHkToaGhAIChQ4cCAJqbm7Fjxw6sXbtWnoRERKRq3T6ZYufOnf/UHb2hoQHz5s1DZGQkJkyYIHk4IiKibq1RGY1GBAUFoaqqCq6urgCAs2fPIikpCXFxcUhMTOzRk3KNioiI7mXVGtXp06cxfPhwc5EyGAx45ZVXkJ6ejoiICMlCEhER3atbU3+XLl2Cp6en+XFJSQmampqwbds26PV66PV6bNy4UbaQRESkXmyhREREiiDJ6en0cPGaGyJSI7ZQIiIiRWOhIiIiRRPd66+1tRUnT56Ei4sLAOD1119HcHCwfEmJiEiVRPX627lzJ+bNm4f8/Hx4eHj0+Ek7f/Dp8b+5HzWttXCNioj6MknuR3W715+LiwuuXLmCjIwM6HQ6bNq0CZ2dnZIEJSIiulO3C9Wdvf6ampowceJEZGdno7i4GCdOnEBJSYmcOYmISKW6PfWXlpYGjUaD8PDwf/pZZWUl9uzZg3fffbdbTyrldVRKb4MvJSmn66R+3ziVSLbE72/fZNXUn9FoxOeff44XXngBwK2WSvv37zf/XBAEODjwkiwi6n1YpJSvW4Xq3l5/giAgOzsb165dQ3t7O4qKinjGHxERyaJbh0H39vrz9fXFggULEBMTg46OjvtOCRIREVmLvf56ESWvx3H6hGyJ+0LfJMnp6URERLbGQkVERIomuoXSCy+8gPXr16OzsxP/+q//itWrV8PJyUnWsEREpD6iWyi99NJL2Lp1K0aMGIG0tDRMnjwZL7/cvbUnrlEREdG9JG2hNHDgQJhMJhgMBphMJrS1tcHZ2VmSoERERHfq9lW6d7ZQAm4Vrbi4OLi5uWHYsGEICQmRLSQREalXt4+odu7cifnz5wMAGhsbsWHDBpSXl6O6uhp+fn5Yu3atbCGJiEi9unVEdbuF0rp16wAAJ06cgI+PD5544gkAQFRUFNLT02ULSdRX8dYtRJaJaqHk4+ODuro6NDU1AQCqqqowZswY+VISEZFqiWqhNGLECCxatAjx8fGwt7eHt7c3Vq1aJVtIIiJSL7ZQInqIOPVH9H/ud3o6781B9BCxuBBZxhZKRESkaN0qVGVlZdBqtdBqtcjJyTFvb29vx7x583Ds2DHZAhIRkbpZnPprbW3FmjVrUFFRgf79+yMmJgY1NTXw9PRERkYGTp06ZYucRESkUhYLlclkQmdnJ1pbW+Hq6oqOjg44OzujpKQESUlJeP/9922Rk1RE6nsNcR2IqHezWKjc3NywaNEihIaGwsXFBc899xyeeeYZ+Pv7AwALFRERycriGtW3336Ljz76CAcPHsThw4fRr18/5Ofn2yIbERGR5SOq6upqBAQE4PHHHwcAzJo1C4WFhUhKSpI9HKkTp+qI6E4Wj6h8fX1RU1ODGzduQBAEHDhwgO2SiIjIZiweUU2aNAmnTp3CrFmz4OjoiDFjxmDBggW2yEZERMQWSkREpAxsodQHsC+cOFKf7i4lNX0ORGKxhRIRESkaCxURESmaVb3+AKCgoABxcXGyhCMiIhLd6+/555/Hd999h7y8PHh7e9siq+pJuZ6hpjZFSs5GRJZZPKK6s9dfR0eHudef0WjEihUrkJaWZoucRESkUqJ7/a1btw6RkZEYNmyYLXISEZFKie71V19fj8jISFtkJCIiFRPV6y8/Px9NTU3Q6/W4ceMGmpqakJ6ejnfeeUfuvCQRJa/b8LoncZS87sjPVBwlf6a2JKrX3/jx47Fv3z6UlZVh9erVGD16NIsUERHJgr3+iIhI0brVQmnBggX3LU4TJkzAhAkTJA1FRER0G5vSEhGRItyvKS1bKBERkaJ1a+qvrKwMeXl5AIApU6Zg4sSJePvtt80/b2hogJ+fH9577z15UhIRkWqJaqE0efJklJWVAQAaGxsRExOD5cuXyx5W7dRymw8ln5Kr5GxEtmaryw5Et1C6bf369YiOjsbw4cPlzElERCplsVDd2UIpKCgIQ4cOxTPPPAMAuHDhAo4fP474+HjZgxIRkTqJbqEEAEVFRYiNjYWTk5PsQYmISJ1EtVAqLCxEUlISqqqqzEWL5Kfk9Qwlt8jh2p44anmtSn6dSif1e1fZ2fV2US2UxowZg+bmZty8eRNeXl6SBiUiIrqT6BZKp0+fhqenpy0yEhGRioluoTR27FgUFxfLEoqIiOi2bhUqUgYlz8sreZ5fyetnUlLyZyA1Nb1WYgslIiJSOBYqIiJStG4Vqry8PMyYMQM6nQ5//OMfAQA1NTXQ6XTQaDTIzc2VNSQREamXxTWqmpoa7N27Fx999BFcXFywcOFCfPzxx3j77bexfft2DBkyBMnJyTh06BCCgoJskfkuSr52RMnZlIzvmzh836ivsnhEderUKUyaNAlubm6wt7fH5MmTsWvXLnh7e8PLywsODg7Q6XSoqKiwRV4iIlIZi4Vq1KhRqK6uxs8//4y2tjYcOHAA//3f/w13d3fz73h4eKChoUHWoEREpE4Wp/4CAgIwa9YsxMXFYcCAAQgICEB1dTXs7OzMvyMIwl2PiYiIpGKxUBkMBmg0GsyfPx8AsGXLFowfPx6NjY3m32lsbISHh4d8KR9AyfPoSs6mZHzfxOH7RrammPtRXb58Ga+99ho6Ojpw/fp1lJSUID09HefPn8fFixdhMplQXl6OKVOm2CIvERGpjMUjKl9fX2g0GsycORMmkwkJCQnw9/fHunXrkJqaira2NgQFBSEkJMQWeYmISGXsBEEQbP2kwf1etvVTEtFDpOQ2VlJPmSq51ZnSVXbu6nI7O1MQEZGisVAREZGiiW6hVFhYCK1Wi7CwMOTk5OAhzCASEZEKiGqhtHXrVuzcuRN79uyBs7Mz5s6diyNHjmDSpEm2yNxrsKUNkTyUvC8oOZvUFHN6elctlL799lt88skncHV1RUtLCwwGA/r372+LvEREpDKiWig1NTXB0dERxcXFmD59Otzd3eHr62uLvEREpDIWC9WdLZSSkpLg7+8PR0dHAEBUVBSOHTuGQYMGYfPmzbKHJSIi9bF4HZXBYMC1a9cwdOhQALdaKJ07dw6RkZHw9/cHAPzlL3/Bhx9+iPfee69bT8rrqOhBuLZHlvBapb5J9HVUXbVQioiIwJtvvomWlhYIgoD9+/ebixYREZGURLVQGj9+PBYsWIDo6GjY29vj2WefNTetJSIikhJbKJHicOqPLOHUX990v6k/i0dURLbGPxzisMBTX8UWSkREpGiiWyh98cUXiIqKglarxZIlS2A0GmUNSkRE6mSxUN3ZQmnPnj2ora3Fnj17kJqailWrVuGTTz4BAJSUlMgeloiI1MfiGtWdLZQAYPLkyfjd736HwMBAczeKzMxMmEwmeZOSanCtRRy1vE5A2tfK75vyiWqhdOPGDbi6umLx4sXQ6/X4wx/+wF5/REQkC1EtlACguroaS5YsQWlpKVpbW5GXlyd7WCIiUh+LU38GgwEajcZ8Qe+WLVvg6uoKPz8/eHl5AQBCQ0NRUFAgb9L74GF738PPgCzhdVTqIqqF0nvvvYevv/4a9fX1AICDBw9i1KhRsoclIiL1Ed1CadWqVUhJSUFbWxtGjhyJt956yxZ5iYhIZXp9CyVO/RGpD6f++qY+20KJXzKiW9T0P21KzsYiKj22UCIiIkVjoSIiIkUT3euvtLQUYWFh0Ol0WL16NTo6OmQNSkRE6mRxjerOXn8uLi5YuHAh8vLyUFBQgJKSEnh4eGDlypXYvn17n7h5IueXxeH79vDxfaO+yuIR1Z29/uzt7TF58mRs3rwZ48aNg4eHBwBg2rRp+Oyzz2QPS0RE6iOq19+4ceNQW1uL+vp6mEwmVFRUoKmpyRZ5iYhIZSxO/d3Z62/AgAEICAhAbW0t3njjDbz66qt45JFHEBISgr/97W+2yEtERCpj8YJfg8GAa9euYejQoQBu9fr7/vvvkZiYiOHDhwMA9u3bh48//th8ogUREZFURPX60+v1SEhIgMFggNFoREFBAcLCwmyRl4iIVEZUrz9/f38sXLgQc+bMQUdHB8LDw6HT6WyRl4iIVOah9PojIiLqLnamICIiRWOhIiIiRWOhIiIiRWOhIiIiRWOhIiIiRWOhIiIiRVNkodq7dy/CwsKg0WiwY8cOq8czGAwIDw/H5cuXrR5r8+bN0Gq10Gq1WL9+vdXjbdy4EWFhYdBqtdi2bZvV4wFATk4Oli1bZvU4cXFx0Gq10Ov10Ov1qK0V3yH9wIEDmDVrFkJDQ7F69Wqrcu3atcucSa/Xw9/fH6tWrRI9XllZmfkzzcnJsSob0PVtcXrq3u9sTU0NdDodNBoNcnNzrR4PAJYuXYrS0lKrxyoqKjJfS7l8+XIYjUarxissLIRWq0VYWBhycnLQkyto7revFxQUIC4urke5uhpv+fLl0Gg05u9eZWWl6LG++OILREVFQavVYsmSJVa9b4cOHbprn5g4cSKSk5NFZ6uursbMmTMRHh6OpUuXWv2ZWn1bKEFhfvjhB2HatGnCTz/9JPzjH/8QdDqdcObMGdHjffnll0J4eLgwatQo4dKlS1ZlO3LkiDBnzhyhra1NMBqNQnx8vPDpp5+KHu/YsWNCdHS00N7eLrS2tgrTpk0Tzp49a1XGmpoaYcKECcJbb71l1TidnZ3CpEmThPb2dqvGEQRB+P7774VJkyYJ9fX1gtFoFGJiYoS//OUvVo8rCILw97//XQgODhauXr0q6t/fuHFDeO6554SrV68K7e3twuzZs4UjR46IznPkyBEhPDxcuH79utDR0SEkJycL+/fv79EY935nW1tbhaCgIOH7778X2tvbhcTExB69f/eO98MPPwjJycnC2LFjhY8++siqbOfOnROCg4OF69evC52dncLSpUuFbdu2iR7v+++/F4KDg4V//OMfQkdHhzBnzhzh8OHDosa67cyZM8LkyZOFf/u3f7PqtQqCIISHhwsNDQ09Gqersa5fvy4EBgYK33zzjSAIgrB48WJhx44dVmW77ccffxRefPFF4fz586LHmjJlivDdd98JgiAIqampQnFxsehsZ8+eFSZPnmx+37KysoStW7d2ezxBEATFHVHV1NRg4sSJGDBgAFxdXTFjxgxUVFSIHq+4uBhZWVnmW5JYw93dHcuWLYOTkxMcHR0xYsQIXLlyRfR448ePxwcffAAHBwdcvXoVJpMJrq6uosf7+eefkZubi5SUFNFj3Hbu3DkAQGJiImbOnImCggLRY1VWViIsLAyenp5wdHREbm4u/PykuXfSypUrsXjxYgwcOFDUvzeZTOjs7ERrays6OjrQ0dEBZ2dn0Xm6ui1OT2+Bc+93tq6uDt7e3vDy8oKDgwN0Ol2P9ol7x9u7dy9efPFFhIaG9ihXV2M5OTkhKysLbm5usLOzg4+PT4/2iXvH8/LywieffAJXV1e0tLTAYDCgf//+osYCAKPRiBUrViAtLa0Hr7Lr8VpbW3HlyhVkZGRAp9Nh06ZN6OzsFDXWkSNHMG7cOPj6+gIAMjMzERwcLDrbndavX4/o6GhzL1YxY5lMJhgMBphMJrS1tfVon7h3vNOnT1t9WyiLLZRs7ccff4S7u7v5sYeHB+rq6kSPt2bNGiliAQCeeuop839fuHAB+/btw4cffmjVmI6Ojti0aRO2bt2KkJAQDB48WPRYK1aswOLFi1FfX29VJgBoaWlBQEAAfvvb36K9vR3x8fF48sknERgY2OOxLl68CEdHR6SkpKC+vh5Tp05Fenq61Rlrampw8+ZNUX9wb3Nzc8OiRYsQGhoKFxcXPPfcc3jmmWdEjzdq1ChkZ2cjOTkZLi4uOHDgQI+mroB//s52tU80NDSIHi8pKQkAcPLkyR7l6mqsoUOHmhtWNzc3Y8eOHVi7dq3o8YBb+0RxcTFycnIwduxY8x9zMWP9/ve/R2RkJIYNG9btTPcbr6mpCRMnTkRWVhYee+wxJCcno6SkBFFRUT0e6+LFi3B1dcXixYtx7tw5PPPMMz2arr/f37ULFy7g+PHjPfq719Xvrly5EnFxcXBzc8OwYcMQEhIiejxfX1+sW7cO9fX18PDwEHVbKMUdUXV2dsLOzs78WBCEux4rwZkzZ5CYmIilS5d2+/9aHiQtLQ1Hjx5FfX09iouLRY2xa9cuDBkyBAEBAVbnAYCnn34a69evx2OPPYaBAwdi9uzZOHTokKixTCYTjh49iuzsbBQVFaGurg67d++2OuPOnTutvqv0t99+i48++ggHDx7E4cOH0a9fP+Tn54se787b4iQlJcHf3x+Ojo5WZewN+0RDQwPmzZuHyMhITJgwwerxoqKicOzYMQwaNAibN28WNcaRI0dQX1+PyMhIq/MAt4723n33XXh4eMDFxQVxcXFW7RPV1dVYsmQJSktL0drairy8PKszFhUVITY2Fk5OTqLHaGxsxIYNG1BeXo7q6mr4+fn16H8+7vXkk0+abws1d+5c/PrXv+7xPqG4QuXp6YnGxkbz48bGRkmm7aRy8uRJJCQk4I033sBLL71k1Vhnz57FN998AwBwcXGBRqPB6dOnRY315z//GUeOHIFer8emTZtw4MABZGdni8524sQJHD161PxYEAQ4OIg7AB80aBACAgIwcOBAPPLII5g+fbpVR8nArSmdzz//HC+88IJV41RXVyMgIACPP/44nJycMGvWLBw/flz0eAaDARqNBnv37sX27dvh5OQELy8vqzIqfZ84e/YsoqOj8dJLL2HhwoVWjVVfX28+0nNwcIBWqxW9T5SXl+PMmTPQ6/XIzMzEV199ZdWR/OnTp7F//37zY2v3CT8/P3h5ecHe3h6hoaFW7xMAUFVVZfWdLE6cOAEfHx888cQT6NevH6KioqzaJ9ra2jB27Fjs2bMHO3fuxODBg3u8TyiuUD3//PM4evQompub0draik8//RRTpkx52LEA3NqJFi5ciA0bNkCr1Vo93uXLl5GZmQmj0Qij0Yiqqir4+/uLGmvbtm0oLy9HWVkZ0tLS8MILLyAjI0N0tuvXr2P9+vVoa2uDwWDA7t27ezSHfqdp06ahuroaLS0tMJlMOHz4MEaNGiU6G3Drj8bw4cOtWtMDbk1L1NTU4MaNGxAEAQcOHMCYMWNEj9fVbXGsmZoEAD8/P5w/fx4XL16EyWRCeXm5YvYJg8GAV155BYsWLUJiYqLV412/fh1vvvkmWlpaIAgC9u/fL3qfWLt2Lfbt24eysjKsXr0ao0ePxjvvvCM6myAIyM7OxrVr19De3o6ioiLR+8SkSZPw9ddfm6fpDx48aPU+0dzcjJs3b1r9P0Y+Pj6oq6szT89VVVVZtU/cuHHD6ttCKW6NavDgwVi8eDHi4+PR3t6O2bNnY+zYsQ87FgAgPz8fbW1tWLdunXlbdHQ0YmJiRI0XFBSEuro6REREwN7eHhqNRpICKIVp06ahtrYWERER6OzsRGxsLJ5++mlRY/n5+SEpKQmxsbFob29HYGCg1dMxly5dgqenp1VjALf+YJw6dQqzZs2Co6MjxowZgwULFoge7363xbGGs7Mz1q1bh9TUVLS1tSEoKKhHawZyKikpQVNTE7Zt22a+vOKFF17AokWLRI3n4+ODBQsWIDo6Gvb29nj22Wetnt6Viq+vLxYsWICYmBh0dHRAo9EgPDxc1FhDhgzBqlWrkJKSgra2NowcORJvvfWWVfkuX74syT4xYsQILFq0CPHx8bC3t4e3t7dVl3/8y7/8i9W3heJtPoiISNEUN/VHRER0JxYqIiJSNBYqIiJSNBYqIiJSNBYqIiJSNBYqIiJSNBYqIiJSNBYqIiJStP8PqrwjibPfrbkAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Generate a larger synthetic dataset for demonstration\n", "np.random.seed(2020)\n", "npx = np.zeros((100,20))\n", "mask = np.random.choice([True, False], npx.shape, p=[.1, .9])\n", "npx[mask] = np.nan\n", "sns.heatmap(pd.DataFrame(npx).isnull(), cmap='viridis', cbar=False);" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.8" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": true, "title_cell": "Lecture Outline", "title_sidebar": "Contents", "toc_cell": true, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "305.797px" }, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }