{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![](../docs/banner.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Chapter 7: Introduction to Pandas" ] }, { "cell_type": "markdown", "metadata": { "toc": true }, "source": [ "

Chapter Outline

\n", "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Chapter Learning Objectives\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Create Pandas series with `pd.Series()` and Pandas dataframe with `pd.DataFrame()`\n", "- Be able to access values from a Series/DataFrame by indexing, slicing and boolean indexing using notation such as `df[]`, `df.loc[]`, `df.iloc[]`, `df.query[]`\n", "- Perform basic arithmetic operations between two series and anticipate the result.\n", "- Describe how Pandas assigns dtypes to Series and what the `object` dtype is\n", "- Read a standard .csv file from a local path or url using Pandas `pd.read_csv()`.\n", "- Explain the relationship and differences between `np.ndarray`, `pd.Series` and `pd.DataFrame` objects in Python." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Introduction to Pandas\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](img/chapter7/pandas.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas is most popular Python library for tabular data structures. You can think of Pandas as an extremely powerful version of Excel (but free and with a lot more features!) \n", "\n", "Pandas can be installed using `conda`:\n", "\n", "```\n", "conda install pandas\n", "```\n", "\n", "We usually import pandas with the alias `pd`. You'll see these two imports at the top of most data science workflows:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Pandas Series\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What are Series?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A Series is like a NumPy array but with labels. They are strictly 1-dimensional and can contain any data type (integers, strings, floats, objects, etc), including a mix of them. Series can be created from a scalar, a list, ndarray or dictionary using `pd.Series()` (**note the captial \"S\"**). Here are some example series:\n", "\n", "![](img/chapter7/series.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating Series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, series are labelled with indices starting from 0. For example:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 -5.0\n", "1 1.3\n", "2 21.0\n", "3 6.0\n", "4 3.0\n", "dtype: float64" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(data = [-5, 1.3, 21, 6, 3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But you can add a custom index:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a -5.0\n", "b 1.3\n", "c 21.0\n", "d 6.0\n", "e 3.0\n", "dtype: float64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(data = [-5, 1.3, 21, 6, 3],\n", " index = ['a', 'b', 'c', 'd', 'e'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can create a Series from a dictionary:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a 10\n", "b 20\n", "c 30\n", "dtype: int64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(data = {'a': 10, 'b': 20, 'c': 30})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or from an ndarray:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 -0.428301\n", "1 -0.104959\n", "2 0.170835\n", "dtype: float64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(data = np.random.randn(3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or even a scalar:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 3.141\n", "dtype: float64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(3.141)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a 3.141\n", "b 3.141\n", "c 3.141\n", "dtype: float64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(data=3.141, index=['a', 'b', 'c'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Series Characteristics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Series can be given a `name` attribute. I almost never use this but it might come up sometimes:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 -1.363190\n", "1 0.425801\n", "2 -0.048966\n", "3 -0.298172\n", "4 1.899199\n", "Name: random_series, dtype: float64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = pd.Series(data = np.random.randn(5), name='random_series')\n", "s" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'random_series'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.name" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 -1.363190\n", "1 0.425801\n", "2 -0.048966\n", "3 -0.298172\n", "4 1.899199\n", "Name: another_name, dtype: float64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.rename(\"another_name\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can access the index labels of your series using the `.index` attribute:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=5, step=1)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can access the underlying data array using `.to_numpy()`:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([-1.36319006, 0.42580052, -0.04896627, -0.29817227, 1.89919866])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.to_numpy()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([list([1, 2, 3]), 'b', 1], dtype=object)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series([[1, 2, 3], \"b\", 1]).to_numpy()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Indexing and Slicing Series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Series are very much like ndarrays (in fact, series can be passed to most NumPy functions!). They can be indexed using square brackets `[ ]` and sliced using colon `:` notation:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "A 0\n", "B 1\n", "C 2\n", "D 3\n", "E 4\n", "dtype: int64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = pd.Series(data = range(5),\n", " index = ['A', 'B', 'C', 'D', 'E'])\n", "s" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s[0]" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "B 1\n", "C 2\n", "D 3\n", "dtype: int64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s[[1, 2, 3]]" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "A 0\n", "B 1\n", "C 2\n", "dtype: int64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s[0:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note above how array-based indexing and slicing also returns the series index.\n", "\n", "Series are also like dictionaries, in that we can access values using index labels:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s[\"A\"]" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "B 1\n", "D 3\n", "C 2\n", "dtype: int64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s[[\"B\", \"D\", \"C\"]]" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "A 0\n", "B 1\n", "C 2\n", "dtype: int64" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s[\"A\":\"C\"]" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"A\" in s" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"Z\" in s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Series do allow for non-unique indexing, but **be careful** because indexing operations won't return unique values:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "A 0\n", "A 1\n", "A 2\n", "B 3\n", "C 4\n", "dtype: int64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = pd.Series(data = range(5),\n", " index = [\"A\", \"A\", \"A\", \"B\", \"C\"])\n", "x" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "A 0\n", "A 1\n", "A 2\n", "dtype: int64" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x[\"A\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we can also do boolean indexing with series:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "B 1\n", "C 2\n", "D 3\n", "E 4\n", "dtype: int64" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s[s >= 1]" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "D 3\n", "E 4\n", "dtype: int64" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s[s > s.mean()]" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "A True\n", "B False\n", "C True\n", "D True\n", "E True\n", "dtype: bool" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(s != 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Series Operations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unlike ndarrays operations between Series (+, -, /, \\*) align values based on their **LABELS** (not their position in the structure). The resulting index will be the __*sorted union*__ of the two indexes. This gives you the flexibility to run operations on series regardless of their labels." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "A 0\n", "B 1\n", "C 2\n", "D 3\n", "dtype: int64" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1 = pd.Series(data = range(4),\n", " index = [\"A\", \"B\", \"C\", \"D\"])\n", "s1" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "B 10\n", "C 11\n", "D 12\n", "E 13\n", "dtype: int64" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s2 = pd.Series(data = range(10, 14),\n", " index = [\"B\", \"C\", \"D\", \"E\"])\n", "s2" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "A NaN\n", "B 11.0\n", "C 13.0\n", "D 15.0\n", "E NaN\n", "dtype: float64" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1 + s2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see above, indices that match will be operated on. Indices that don't match will appear in the product but with `NaN` values:\n", "\n", "![](img/chapter7/series_addition.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also perform standard operations on a series, like multiplying or squaring. NumPy also accepts series as an argument to most functions because series are built off numpy arrays (more on that later):" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "A 0\n", "B 1\n", "C 4\n", "D 9\n", "dtype: int64" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1 ** 2" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "A 1.000000\n", "B 2.718282\n", "C 7.389056\n", "D 20.085537\n", "dtype: float64" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.exp(s1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, just like arrays, series have many built-in methods for various operations. You can find them all by running `help(pd.Series)`:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['T', 'abs', 'add', 'add_prefix', 'add_suffix', 'agg', 'aggregate', 'align', 'all', 'any', 'append', 'apply', 'argmax', 'argmin', 'argsort', 'array', 'asfreq', 'asof', 'astype', 'at', 'at_time', 'attrs', 'autocorr', 'axes', 'backfill', 'between', 'between_time', 'bfill', 'bool', 'cat', 'clip', 'combine', 'combine_first', 'compare', 'convert_dtypes', 'copy', 'corr', 'count', 'cov', 'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'div', 'divide', 'divmod', 'dot', 'drop', 'drop_duplicates', 'droplevel', 'dropna', 'dt', 'dtype', 'dtypes', 'duplicated', 'empty', 'eq', 'equals', 'ewm', 'expanding', 'explode', 'factorize', 'ffill', 'fillna', 'filter', 'first', 'first_valid_index', 'floordiv', 'ge', 'get', 'groupby', 'gt', 'hasnans', 'head', 'hist', 'iat', 'idxmax', 'idxmin', 'iloc', 'index', 'infer_objects', 'interpolate', 'is_monotonic', 'is_monotonic_decreasing', 'is_monotonic_increasing', 'is_unique', 'isin', 'isna', 'isnull', 'item', 'items', 'iteritems', 'keys', 'kurt', 'kurtosis', 'last', 'last_valid_index', 'le', 'loc', 'lt', 'mad', 'map', 'mask', 'max', 'mean', 'median', 'memory_usage', 'min', 'mod', 'mode', 'mul', 'multiply', 'name', 'nbytes', 'ndim', 'ne', 'nlargest', 'notna', 'notnull', 'nsmallest', 'nunique', 'pad', 'pct_change', 'pipe', 'plot', 'pop', 'pow', 'prod', 'product', 'quantile', 'radd', 'rank', 'ravel', 'rdiv', 'rdivmod', 'reindex', 'reindex_like', 'rename', 'rename_axis', 'reorder_levels', 'repeat', 'replace', 'resample', 'reset_index', 'rfloordiv', 'rmod', 'rmul', 'rolling', 'round', 'rpow', 'rsub', 'rtruediv', 'sample', 'searchsorted', 'sem', 'set_axis', 'shape', 'shift', 'size', 'skew', 'slice_shift', 'sort_index', 'sort_values', 'sparse', 'squeeze', 'std', 'str', 'sub', 'subtract', 'sum', 'swapaxes', 'swaplevel', 'tail', 'take', 'to_clipboard', 'to_csv', 'to_dict', 'to_excel', 'to_frame', 'to_hdf', 'to_json', 'to_latex', 'to_list', 'to_markdown', 'to_numpy', 'to_period', 'to_pickle', 'to_sql', 'to_string', 'to_timestamp', 'to_xarray', 'tolist', 'transform', 'transpose', 'truediv', 'truncate', 'tshift', 'tz_convert', 'tz_localize', 'unique', 'unstack', 'update', 'value_counts', 'values', 'var', 'view', 'where', 'xs']\n" ] } ], "source": [ "print([_ for _ in dir(pd.Series) if not _.startswith(\"_\")]) # print all common methods" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "A 0\n", "B 1\n", "C 2\n", "D 3\n", "dtype: int64" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.5" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1.mean()" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "6" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1.sum()" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "A 0.0\n", "B 1.0\n", "C 2.0\n", "D 3.0\n", "dtype: float64" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1.astype(float)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**\"Chaining\"** operations together is also common with pandas:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "22" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1.add(3.141).pow(2).mean().astype(int)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data Types" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Series can hold all the data types (`dtypes`) you're used to, e.g., `int`, `float`, `bool`, etc. There are a few other special data types too (`object`, `DateTime` and `Categorical`) which we'll talk about in this and later chapters. You can always read more about pandas dtypes [in the documentation too](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#dtypes). For example, here's a series of `dtype` int64:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dtype('int64')" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = pd.Series(range(5))\n", "x.dtype" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dtype \"`object`\" is used for series of strings or mixed data. Pandas is [currently experimenting](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.StringDtype.html#pandas.StringDtype) with a dedicated string dtype `StringDtype`, but it is still in testing." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 A\n", "1 B\n", "dtype: object" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = pd.Series(['A', 'B'])\n", "x" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 A\n", "1 1\n", "2 [I, AM, A, LIST]\n", "dtype: object" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = pd.Series(['A', 1, [\"I\", \"AM\", \"A\", \"LIST\"]])\n", "x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While flexible, it is recommended to avoid the use of `object` dtypes because of higher memory requirements. Essentially, in an `object` dtype series, every single element stores information about its individual dtype. We can inspect the dtypes of all the elements in a mixed series in several ways, below I'll use the `map` method:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 \n", "1 \n", "2 \n", "dtype: object" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x.map(type)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that each object in our series has a different dtype. This comes at a cost. Compare the [memory usage](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.memory_usage.html) of the series below:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "x1 dtype: int64\n", "x1 memory usage: 152 bytes\n", "\n", "x2 dtype: object\n", "x2 memory usage: 258 bytes\n", "\n", "x3 dtype: int8\n", "x3 memory usage: 131 bytes\n" ] } ], "source": [ "x1 = pd.Series([1, 2, 3])\n", "print(f\"x1 dtype: {x1.dtype}\")\n", "print(f\"x1 memory usage: {x1.memory_usage(deep=True)} bytes\")\n", "print(\"\")\n", "x2 = pd.Series([1, 2, \"3\"])\n", "print(f\"x2 dtype: {x2.dtype}\")\n", "print(f\"x2 memory usage: {x2.memory_usage(deep=True)} bytes\")\n", "print(\"\")\n", "x3 = pd.Series([1, 2, \"3\"]).astype('int8') # coerce the object series to int8\n", "print(f\"x3 dtype: {x3.dtype}\")\n", "print(f\"x3 memory usage: {x3.memory_usage(deep=True)} bytes\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In summary, try to use uniform dtypes where possible - they are more memory efficient!\n", "\n", "One more gotcha, `NaN` (frequently used to represent missing values in data) is a float:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "float" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(np.NaN)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This can be problematic if you have a series of integers and one missing value, because Pandas will cast the whole series to a float:" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "1 2.0\n", "2 3.0\n", "3 NaN\n", "dtype: float64" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series([1, 2, 3, np.NaN])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Only recently, Pandas has implemented a \"[nullable integer dtype](https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html)\", which can handle `NaN` in an integer series without affecting the `dtype`. Note the captial \"I\" in the type below, differentiating it from numpy's `int64` dtype:" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "3 \n", "dtype: Int64" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series([1, 2, 3, np.NaN]).astype('Int64')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is not the default in Pandas yet and functionality of this new feature is still subject to change." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Pandas DataFrames\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What are DataFrames?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas DataFrames are you're new best friend. They are like the Excel spreadsheets you may be used to. DataFrames are really just Series stuck together! Think of a DataFrame as a dictionary of series, with the \"keys\" being the column labels and the \"values\" being the series data:\n", "\n", "![](img/chapter7/dataframe.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating DataFrames" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Dataframes can be created using `pd.DataFrame()` (note the capital \"D\" and \"F\"). Like series, index and column labels of dataframes are labelled starting from 0 by default:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0123
1456
2789
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 1 2 3\n", "1 4 5 6\n", "2 7 8 9" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame([[1, 2, 3],\n", " [4, 5, 6],\n", " [7, 8, 9]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the `index` and `columns` arguments to give them labels:" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
C1C2C3
R1123
R2456
R3789
\n", "
" ], "text/plain": [ " C1 C2 C3\n", "R1 1 2 3\n", "R2 4 5 6\n", "R3 7 8 9" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame([[1, 2, 3],\n", " [4, 5, 6],\n", " [7, 8, 9]],\n", " index = [\"R1\", \"R2\", \"R3\"],\n", " columns = [\"C1\", \"C2\", \"C3\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are so many ways to create dataframes. I most often create them from dictionaries or ndarrays:" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
C1C2
R11A
R22B
R33C
\n", "
" ], "text/plain": [ " C1 C2\n", "R1 1 A\n", "R2 2 B\n", "R3 3 C" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame({\"C1\": [1, 2, 3],\n", " \"C2\": ['A', 'B', 'C']},\n", " index=[\"R1\", \"R2\", \"R3\"])" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col_1col_2col_3col_4col_5
row_1-1.511598-1.0738752.9904742.4080820.101569
row_20.7672460.423030-0.1354500.3695450.761417
row_30.7146771.4894300.843088-1.2846661.310033
row_4-0.5136560.5395310.2070570.4258880.481794
row_5-1.361988-0.4791580.158281-0.1968130.136745
\n", "
" ], "text/plain": [ " col_1 col_2 col_3 col_4 col_5\n", "row_1 -1.511598 -1.073875 2.990474 2.408082 0.101569\n", "row_2 0.767246 0.423030 -0.135450 0.369545 0.761417\n", "row_3 0.714677 1.489430 0.843088 -1.284666 1.310033\n", "row_4 -0.513656 0.539531 0.207057 0.425888 0.481794\n", "row_5 -1.361988 -0.479158 0.158281 -0.196813 0.136745" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(np.random.randn(5, 5),\n", " index=[f\"row_{_}\" for _ in range(1, 6)],\n", " columns=[f\"col_{_}\" for _ in range(1, 6)])" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01
0Tom7
1Mike15
2Tiffany3
\n", "
" ], "text/plain": [ " 0 1\n", "0 Tom 7\n", "1 Mike 15\n", "2 Tiffany 3" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(np.array([['Tom', 7], ['Mike', 15], ['Tiffany', 3]]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's a table of the main ways you can create dataframes (see the [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dataframe) for more):\n", "\n", "|Create DataFrame from|Code|\n", "|---|---|\n", "|Lists of lists|`pd.DataFrame([['Tom', 7], ['Mike', 15], ['Tiffany', 3]])`|\n", "|ndarray|`pd.DataFrame(np.array([['Tom', 7], ['Mike', 15], ['Tiffany', 3]]))`|\n", "|Dictionary|`pd.DataFrame({\"Name\": ['Tom', 'Mike', 'Tiffany'], \"Number\": [7, 15, 3]})`|\n", "|List of tuples|`pd.DataFrame(zip(['Tom', 'Mike', 'Tiffany'], [7, 15, 3]))`|\n", "|Series|`pd.DataFrame({\"Name\": pd.Series(['Tom', 'Mike', 'Tiffany']), \"Number\": pd.Series([7, 15, 3])})`|\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Indexing and Slicing DataFrames" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are several main ways to select data from a DataFrame:\n", "1. `[]`\n", "2. `.loc[]`\n", "3. `.iloc[]`\n", "4. Boolean indexing\n", "5. `.query()`" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameLanguageCourses
0TomPython5
1MikePython4
2TiffanyR7
\n", "
" ], "text/plain": [ " Name Language Courses\n", "0 Tom Python 5\n", "1 Mike Python 4\n", "2 Tiffany R 7" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({\"Name\": [\"Tom\", \"Mike\", \"Tiffany\"],\n", " \"Language\": [\"Python\", \"Python\", \"R\"],\n", " \"Courses\": [5, 4, 7]})\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Indexing with `[]`\n", "Select columns by single labels, lists of labels, or slices:" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Tom\n", "1 Mike\n", "2 Tiffany\n", "Name: Name, dtype: object" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Name'] # returns a series" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Name
0Tom
1Mike
2Tiffany
\n", "
" ], "text/plain": [ " Name\n", "0 Tom\n", "1 Mike\n", "2 Tiffany" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[['Name']] # returns a dataframe!" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameLanguage
0TomPython
1MikePython
2TiffanyR
\n", "
" ], "text/plain": [ " Name Language\n", "0 Tom Python\n", "1 Mike Python\n", "2 Tiffany R" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[['Name', 'Language']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can only index rows by using slices, not single values (but not recommended, see preferred methods below)." ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "tags": [ "raises-exception" ] }, "outputs": [ { "ename": "KeyError", "evalue": "0", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/opt/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 2888\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2889\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcasted_key\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2890\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n", "\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n", "\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n", "\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n", "\u001b[0;31mKeyError\u001b[0m: 0", "\nThe above exception was the direct cause of the following exception:\n", "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;31m# doesn't work\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m/opt/miniconda3/lib/python3.7/site-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 2900\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnlevels\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2901\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_getitem_multilevel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2902\u001b[0;31m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2903\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mis_integer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2904\u001b[0m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/opt/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 2889\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcasted_key\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2890\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2891\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2892\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2893\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mtolerance\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mKeyError\u001b[0m: 0" ] } ], "source": [ "df[0] # doesn't work" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameLanguageCourses
0TomPython5
\n", "
" ], "text/plain": [ " Name Language Courses\n", "0 Tom Python 5" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[0:1] # does work" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameLanguageCourses
1MikePython4
2TiffanyR7
\n", "
" ], "text/plain": [ " Name Language Courses\n", "1 Mike Python 4\n", "2 Tiffany R 7" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[1:] # does work" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Indexing with `.loc` and `.iloc`\n", "Pandas created the methods `.loc[]` and `.iloc[]` as more flexible alternatives for accessing data from a dataframe. Use `df.iloc[]` for indexing with integers and `df.loc[]` for indexing with labels. These are typically the [recommended methods of indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated) in Pandas." ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameLanguageCourses
0TomPython5
1MikePython4
2TiffanyR7
\n", "
" ], "text/plain": [ " Name Language Courses\n", "0 Tom Python 5\n", "1 Mike Python 4\n", "2 Tiffany R 7" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we'll try out `.iloc` which accepts *integers* as references to rows/columns:" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Name Tom\n", "Language Python\n", "Courses 5\n", "Name: 0, dtype: object" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[0] # returns a series" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameLanguageCourses
0TomPython5
1MikePython4
\n", "
" ], "text/plain": [ " Name Language Courses\n", "0 Tom Python 5\n", "1 Mike Python 4" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[0:2] # slicing returns a dataframe" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'R'" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[2, 1] # returns the indexed object" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
LanguageCourses
0Python5
1Python4
\n", "
" ], "text/plain": [ " Language Courses\n", "0 Python 5\n", "1 Python 4" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[[0, 1], [1, 2]] # returns a dataframe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's look at `.loc` which accepts *labels* as references to rows/columns:" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Tom\n", "1 Mike\n", "2 Tiffany\n", "Name: Name, dtype: object" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[:, 'Name']" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameLanguage
0TomPython
1MikePython
2TiffanyR
\n", "
" ], "text/plain": [ " Name Language\n", "0 Tom Python\n", "1 Mike Python\n", "2 Tiffany R" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[:, 'Name':'Language']" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Language
0Python
2R
\n", "
" ], "text/plain": [ " Language\n", "0 Python\n", "2 R" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[[0, 2], ['Language']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes we want to use a mix of integers and labels to reference data in a dataframe. The easiest way to do this is to use `.loc[]` with a label then use an integer in combinations with `.index` or `.columns`:" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=3, step=1)" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.index" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['Name', 'Language', 'Courses'], dtype='object')" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[df.index[0], 'Courses'] # I want to reference the first row and the column named \"Courses\"" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'R'" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[2, df.columns[1]] # I want to reference row \"2\" and the second column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Boolean indexing\n", "Just like with series, we can select data based on boolean masks:" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameLanguageCourses
0TomPython5
1MikePython4
2TiffanyR7
\n", "
" ], "text/plain": [ " Name Language Courses\n", "0 Tom Python 5\n", "1 Mike Python 4\n", "2 Tiffany R 7" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameLanguageCourses
2TiffanyR7
\n", "
" ], "text/plain": [ " Name Language Courses\n", "2 Tiffany R 7" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df['Courses'] > 5]" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameLanguageCourses
0TomPython5
\n", "
" ], "text/plain": [ " Name Language Courses\n", "0 Tom Python 5" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df['Name'] == \"Tom\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Indexing with `.query()`\n", "Boolean masks work fine, but I prefer to use the `.query()` method for selecting data. `df.query()` is a powerful tool for filtering data. It has an odd syntax, one of the strangest I've seen in Python, it is more like SQL - `df.query()` accepts a string expression to evaluate and it \"knows\" the names of the columns in your dataframe." ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameLanguageCourses
0TomPython5
\n", "
" ], "text/plain": [ " Name Language Courses\n", "0 Tom Python 5" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.query(\"Courses > 4 & Language == 'Python'\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note the use of single quotes AND double quotes above, lucky we have both in Python! Compare this to the equivalent boolean indexing operation and you can see that `.query()` is much more readable, especially as the query gets bigger!" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameLanguageCourses
0TomPython5
\n", "
" ], "text/plain": [ " Name Language Courses\n", "0 Tom Python 5" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[(df['Courses'] > 4) & (df['Language'] == 'Python')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Query also allows you to reference variable in the current workspace using the `@` symbol:" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameLanguageCourses
0TomPython5
2TiffanyR7
\n", "
" ], "text/plain": [ " Name Language Courses\n", "0 Tom Python 5\n", "2 Tiffany R 7" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "course_threshold = 4\n", "df.query(\"Courses > @course_threshold\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Indexing cheatsheet\n", "\n", "|Method|Syntax|Output|\n", "|---|---|---|\n", "|Select column|`df[col_label]`|Series|\n", "|Select row slice|`df[row_1_int:row_2_int]`|DataFrame|\n", "|Select row/column by label|`df.loc[row_label(s), col_label(s)]`|Object for single selection, Series for one row/column, otherwise DataFrame|\n", "|Select row/column by integer|`df.iloc[row_int(s), col_int(s)]`|Object for single selection, Series for one row/column, otherwise DataFrame|\n", "|Select by row integer & column label|`df.loc[df.index[row_int], col_label]`|Object for single selection, Series for one row/column, otherwise DataFrame|\n", "|Select by row label & column integer|`df.loc[row_label, df.columns[col_int]]`|Object for single selection, Series for one row/column, otherwise DataFrame|\n", "|Select by boolean|`df[bool_vec]`|Object for single selection, Series for one row/column, otherwise DataFrame|\n", "|Select by boolean expression|`df.query(\"expression\")`|Object for single selection, Series for one row/column, otherwise DataFrame|" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading/Writing Data From External Sources" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### .csv files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A lot of the time you will be loading .csv files for use in pandas. You can use the `pd.read_csv()` function for this. In the following chapters we'll use a real dataset of my cycling commutes to the University of British Columbia. There are so many arguments that can be used to help read in your .csv file in an efficient and appropriate manner, feel free to check them out now (by using `shift + tab` in Jupyter, or typing `help(pd.read_csv)`)." ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameTypeTimeDistanceComments
Date
2019-09-10 00:13:04Afternoon RideRide208412.62Rain
2019-09-10 13:52:18Morning RideRide253113.03rain
2019-09-11 00:23:50Afternoon RideRide186312.52Wet road but nice weather
2019-09-11 14:06:19Morning RideRide219212.84Stopped for photo of sunrise
2019-09-12 00:28:05Afternoon RideRide189112.48Tired by the end of the week
2019-09-16 13:57:48Morning RideRide227212.45Rested after the weekend!
2019-09-17 00:15:47Afternoon RideRide197312.45Legs feeling strong!
2019-09-17 13:43:34Morning RideRide228512.60Raining
2019-09-18 13:49:53Morning RideRide290314.57Raining today
2019-09-18 00:15:52Afternoon RideRide210112.48Pumped up tires
2019-09-19 00:30:01Afternoon RideRide4806212.48Feeling good
2019-09-19 13:52:09Morning RideRide209012.59Getting colder which is nice
2019-09-20 01:02:05Afternoon RideRide296112.81Feeling good
2019-09-23 13:50:41Morning RideRide246212.68Rested after the weekend!
2019-09-24 00:35:42Afternoon RideRide207612.47Oiled chain, bike feels smooth
2019-09-24 13:41:24Morning RideRide232112.68Bike feeling much smoother
2019-09-25 00:07:21Afternoon RideRide177512.10Feeling really tired
2019-09-25 13:35:41Morning RideRide212412.65Stopped for photo of sunrise
2019-09-26 00:13:33Afternoon RideRide186012.52raining
2019-09-26 13:42:43Morning RideRide235012.91Detour around trucks at Jericho
2019-09-27 01:00:18Afternoon RideRide171212.47Tired by the end of the week
2019-09-30 13:53:52Morning RideRide211812.71Rested after the weekend!
2019-10-01 00:15:07Afternoon RideRide1732NaNLegs feeling strong!
2019-10-01 13:45:55Morning RideRide222212.82Beautiful morning! Feeling fit
2019-10-02 00:13:09Afternoon RideRide1756NaNA little tired today but good weather
2019-10-02 13:46:06Morning RideRide213413.06Bit tired today but good weather
2019-10-03 00:45:22Afternoon RideRide172412.52Feeling good
2019-10-03 13:47:36Morning RideRide218212.68Wet road
2019-10-04 01:08:08Afternoon RideRide187012.63Very tired, riding into the wind
2019-10-09 13:55:40Morning RideRide214912.70Really cold! But feeling good
2019-10-10 00:10:31Afternoon RideRide184112.59Feeling good after a holiday break!
2019-10-10 13:47:14Morning RideRide246312.79Stopped for photo of sunrise
2019-10-11 00:16:57Afternoon RideRide184311.79Bike feeling tight, needs an oil and pump
\n", "
" ], "text/plain": [ " Name Type Time Distance \\\n", "Date \n", "2019-09-10 00:13:04 Afternoon Ride Ride 2084 12.62 \n", "2019-09-10 13:52:18 Morning Ride Ride 2531 13.03 \n", "2019-09-11 00:23:50 Afternoon Ride Ride 1863 12.52 \n", "2019-09-11 14:06:19 Morning Ride Ride 2192 12.84 \n", "2019-09-12 00:28:05 Afternoon Ride Ride 1891 12.48 \n", "2019-09-16 13:57:48 Morning Ride Ride 2272 12.45 \n", "2019-09-17 00:15:47 Afternoon Ride Ride 1973 12.45 \n", "2019-09-17 13:43:34 Morning Ride Ride 2285 12.60 \n", "2019-09-18 13:49:53 Morning Ride Ride 2903 14.57 \n", "2019-09-18 00:15:52 Afternoon Ride Ride 2101 12.48 \n", "2019-09-19 00:30:01 Afternoon Ride Ride 48062 12.48 \n", "2019-09-19 13:52:09 Morning Ride Ride 2090 12.59 \n", "2019-09-20 01:02:05 Afternoon Ride Ride 2961 12.81 \n", "2019-09-23 13:50:41 Morning Ride Ride 2462 12.68 \n", "2019-09-24 00:35:42 Afternoon Ride Ride 2076 12.47 \n", "2019-09-24 13:41:24 Morning Ride Ride 2321 12.68 \n", "2019-09-25 00:07:21 Afternoon Ride Ride 1775 12.10 \n", "2019-09-25 13:35:41 Morning Ride Ride 2124 12.65 \n", "2019-09-26 00:13:33 Afternoon Ride Ride 1860 12.52 \n", "2019-09-26 13:42:43 Morning Ride Ride 2350 12.91 \n", "2019-09-27 01:00:18 Afternoon Ride Ride 1712 12.47 \n", "2019-09-30 13:53:52 Morning Ride Ride 2118 12.71 \n", "2019-10-01 00:15:07 Afternoon Ride Ride 1732 NaN \n", "2019-10-01 13:45:55 Morning Ride Ride 2222 12.82 \n", "2019-10-02 00:13:09 Afternoon Ride Ride 1756 NaN \n", "2019-10-02 13:46:06 Morning Ride Ride 2134 13.06 \n", "2019-10-03 00:45:22 Afternoon Ride Ride 1724 12.52 \n", "2019-10-03 13:47:36 Morning Ride Ride 2182 12.68 \n", "2019-10-04 01:08:08 Afternoon Ride Ride 1870 12.63 \n", "2019-10-09 13:55:40 Morning Ride Ride 2149 12.70 \n", "2019-10-10 00:10:31 Afternoon Ride Ride 1841 12.59 \n", "2019-10-10 13:47:14 Morning Ride Ride 2463 12.79 \n", "2019-10-11 00:16:57 Afternoon Ride Ride 1843 11.79 \n", "\n", " Comments \n", "Date \n", "2019-09-10 00:13:04 Rain \n", "2019-09-10 13:52:18 rain \n", "2019-09-11 00:23:50 Wet road but nice weather \n", "2019-09-11 14:06:19 Stopped for photo of sunrise \n", "2019-09-12 00:28:05 Tired by the end of the week \n", "2019-09-16 13:57:48 Rested after the weekend! \n", "2019-09-17 00:15:47 Legs feeling strong! \n", "2019-09-17 13:43:34 Raining \n", "2019-09-18 13:49:53 Raining today \n", "2019-09-18 00:15:52 Pumped up tires \n", "2019-09-19 00:30:01 Feeling good \n", "2019-09-19 13:52:09 Getting colder which is nice \n", "2019-09-20 01:02:05 Feeling good \n", "2019-09-23 13:50:41 Rested after the weekend! \n", "2019-09-24 00:35:42 Oiled chain, bike feels smooth \n", "2019-09-24 13:41:24 Bike feeling much smoother \n", "2019-09-25 00:07:21 Feeling really tired \n", "2019-09-25 13:35:41 Stopped for photo of sunrise \n", "2019-09-26 00:13:33 raining \n", "2019-09-26 13:42:43 Detour around trucks at Jericho \n", "2019-09-27 01:00:18 Tired by the end of the week \n", "2019-09-30 13:53:52 Rested after the weekend! \n", "2019-10-01 00:15:07 Legs feeling strong! \n", "2019-10-01 13:45:55 Beautiful morning! Feeling fit \n", "2019-10-02 00:13:09 A little tired today but good weather \n", "2019-10-02 13:46:06 Bit tired today but good weather \n", "2019-10-03 00:45:22 Feeling good \n", "2019-10-03 13:47:36 Wet road \n", "2019-10-04 01:08:08 Very tired, riding into the wind \n", "2019-10-09 13:55:40 Really cold! But feeling good \n", "2019-10-10 00:10:31 Feeling good after a holiday break! \n", "2019-10-10 13:47:14 Stopped for photo of sunrise \n", "2019-10-11 00:16:57 Bike feeling tight, needs an oil and pump " ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = 'data/cycling_data.csv'\n", "df = pd.read_csv(path, index_col=0, parse_dates=True)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can print a dataframe to .csv using `df.to_csv()`. Be sure to check out all of the possible arguments to write your dataframe exactly how you want it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### url" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas also facilitates reading directly from a url - `pd.read_csv()` accepts urls as input:" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BottleGrapeOriginAlcoholpHColourAroma
01ChardonnayAustralia14.233.51WhiteFloral
12Pinot GrigioItaly13.203.30WhiteFruity
23Pinot BlancFrance13.163.16WhiteCitrus
34ShirazChile14.913.39RedBerry
45MalbecArgentina13.833.28RedFruity
\n", "
" ], "text/plain": [ " Bottle Grape Origin Alcohol pH Colour Aroma\n", "0 1 Chardonnay Australia 14.23 3.51 White Floral\n", "1 2 Pinot Grigio Italy 13.20 3.30 White Fruity\n", "2 3 Pinot Blanc France 13.16 3.16 White Citrus\n", "3 4 Shiraz Chile 14.91 3.39 Red Berry\n", "4 5 Malbec Argentina 13.83 3.28 Red Fruity" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url = 'https://raw.githubusercontent.com/TomasBeuzen/toy-datasets/master/wine_1.csv'\n", "pd.read_csv(url)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Other\n", "Pandas can read data from all sorts of other file types including HTML, JSON, Excel, Parquet, Feather, etc. There are generally dedicated functions for reading these file types, see the [Pandas documentation here](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-tools-text-csv-hdf5)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Common DataFrame Operations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "DataFrames have built-in functions for performing most common operations, e.g., `.min()`, `idxmin()`, `sort_values()`, etc. They're all documented in the [Pandas documentation here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) but I'll demonstrate a few below:" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateNameTypeTimeDistanceComments
010 Sep 2019, 00:13:04Afternoon RideRide208412.62Rain
110 Sep 2019, 13:52:18Morning RideRide253113.03rain
211 Sep 2019, 00:23:50Afternoon RideRide186312.52Wet road but nice weather
311 Sep 2019, 14:06:19Morning RideRide219212.84Stopped for photo of sunrise
412 Sep 2019, 00:28:05Afternoon RideRide189112.48Tired by the end of the week
516 Sep 2019, 13:57:48Morning RideRide227212.45Rested after the weekend!
617 Sep 2019, 00:15:47Afternoon RideRide197312.45Legs feeling strong!
717 Sep 2019, 13:43:34Morning RideRide228512.60Raining
818 Sep 2019, 13:49:53Morning RideRide290314.57Raining today
918 Sep 2019, 00:15:52Afternoon RideRide210112.48Pumped up tires
1019 Sep 2019, 00:30:01Afternoon RideRide4806212.48Feeling good
1119 Sep 2019, 13:52:09Morning RideRide209012.59Getting colder which is nice
1220 Sep 2019, 01:02:05Afternoon RideRide296112.81Feeling good
1323 Sep 2019, 13:50:41Morning RideRide246212.68Rested after the weekend!
1424 Sep 2019, 00:35:42Afternoon RideRide207612.47Oiled chain, bike feels smooth
1524 Sep 2019, 13:41:24Morning RideRide232112.68Bike feeling much smoother
1625 Sep 2019, 00:07:21Afternoon RideRide177512.10Feeling really tired
1725 Sep 2019, 13:35:41Morning RideRide212412.65Stopped for photo of sunrise
1826 Sep 2019, 00:13:33Afternoon RideRide186012.52raining
1926 Sep 2019, 13:42:43Morning RideRide235012.91Detour around trucks at Jericho
2027 Sep 2019, 01:00:18Afternoon RideRide171212.47Tired by the end of the week
2130 Sep 2019, 13:53:52Morning RideRide211812.71Rested after the weekend!
221 Oct 2019, 00:15:07Afternoon RideRide1732NaNLegs feeling strong!
231 Oct 2019, 13:45:55Morning RideRide222212.82Beautiful morning! Feeling fit
242 Oct 2019, 00:13:09Afternoon RideRide1756NaNA little tired today but good weather
252 Oct 2019, 13:46:06Morning RideRide213413.06Bit tired today but good weather
263 Oct 2019, 00:45:22Afternoon RideRide172412.52Feeling good
273 Oct 2019, 13:47:36Morning RideRide218212.68Wet road
284 Oct 2019, 01:08:08Afternoon RideRide187012.63Very tired, riding into the wind
299 Oct 2019, 13:55:40Morning RideRide214912.70Really cold! But feeling good
3010 Oct 2019, 00:10:31Afternoon RideRide184112.59Feeling good after a holiday break!
3110 Oct 2019, 13:47:14Morning RideRide246312.79Stopped for photo of sunrise
3211 Oct 2019, 00:16:57Afternoon RideRide184311.79Bike feeling tight, needs an oil and pump
\n", "
" ], "text/plain": [ " Date Name Type Time Distance \\\n", "0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 \n", "1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 \n", "2 11 Sep 2019, 00:23:50 Afternoon Ride Ride 1863 12.52 \n", "3 11 Sep 2019, 14:06:19 Morning Ride Ride 2192 12.84 \n", "4 12 Sep 2019, 00:28:05 Afternoon Ride Ride 1891 12.48 \n", "5 16 Sep 2019, 13:57:48 Morning Ride Ride 2272 12.45 \n", "6 17 Sep 2019, 00:15:47 Afternoon Ride Ride 1973 12.45 \n", "7 17 Sep 2019, 13:43:34 Morning Ride Ride 2285 12.60 \n", "8 18 Sep 2019, 13:49:53 Morning Ride Ride 2903 14.57 \n", "9 18 Sep 2019, 00:15:52 Afternoon Ride Ride 2101 12.48 \n", "10 19 Sep 2019, 00:30:01 Afternoon Ride Ride 48062 12.48 \n", "11 19 Sep 2019, 13:52:09 Morning Ride Ride 2090 12.59 \n", "12 20 Sep 2019, 01:02:05 Afternoon Ride Ride 2961 12.81 \n", "13 23 Sep 2019, 13:50:41 Morning Ride Ride 2462 12.68 \n", "14 24 Sep 2019, 00:35:42 Afternoon Ride Ride 2076 12.47 \n", "15 24 Sep 2019, 13:41:24 Morning Ride Ride 2321 12.68 \n", "16 25 Sep 2019, 00:07:21 Afternoon Ride Ride 1775 12.10 \n", "17 25 Sep 2019, 13:35:41 Morning Ride Ride 2124 12.65 \n", "18 26 Sep 2019, 00:13:33 Afternoon Ride Ride 1860 12.52 \n", "19 26 Sep 2019, 13:42:43 Morning Ride Ride 2350 12.91 \n", "20 27 Sep 2019, 01:00:18 Afternoon Ride Ride 1712 12.47 \n", "21 30 Sep 2019, 13:53:52 Morning Ride Ride 2118 12.71 \n", "22 1 Oct 2019, 00:15:07 Afternoon Ride Ride 1732 NaN \n", "23 1 Oct 2019, 13:45:55 Morning Ride Ride 2222 12.82 \n", "24 2 Oct 2019, 00:13:09 Afternoon Ride Ride 1756 NaN \n", "25 2 Oct 2019, 13:46:06 Morning Ride Ride 2134 13.06 \n", "26 3 Oct 2019, 00:45:22 Afternoon Ride Ride 1724 12.52 \n", "27 3 Oct 2019, 13:47:36 Morning Ride Ride 2182 12.68 \n", "28 4 Oct 2019, 01:08:08 Afternoon Ride Ride 1870 12.63 \n", "29 9 Oct 2019, 13:55:40 Morning Ride Ride 2149 12.70 \n", "30 10 Oct 2019, 00:10:31 Afternoon Ride Ride 1841 12.59 \n", "31 10 Oct 2019, 13:47:14 Morning Ride Ride 2463 12.79 \n", "32 11 Oct 2019, 00:16:57 Afternoon Ride Ride 1843 11.79 \n", "\n", " Comments \n", "0 Rain \n", "1 rain \n", "2 Wet road but nice weather \n", "3 Stopped for photo of sunrise \n", "4 Tired by the end of the week \n", "5 Rested after the weekend! \n", "6 Legs feeling strong! \n", "7 Raining \n", "8 Raining today \n", "9 Pumped up tires \n", "10 Feeling good \n", "11 Getting colder which is nice \n", "12 Feeling good \n", "13 Rested after the weekend! \n", "14 Oiled chain, bike feels smooth \n", "15 Bike feeling much smoother \n", "16 Feeling really tired \n", "17 Stopped for photo of sunrise \n", "18 raining \n", "19 Detour around trucks at Jericho \n", "20 Tired by the end of the week \n", "21 Rested after the weekend! \n", "22 Legs feeling strong! \n", "23 Beautiful morning! Feeling fit \n", "24 A little tired today but good weather \n", "25 Bit tired today but good weather \n", "26 Feeling good \n", "27 Wet road \n", "28 Very tired, riding into the wind \n", "29 Really cold! But feeling good \n", "30 Feeling good after a holiday break! \n", "31 Stopped for photo of sunrise \n", "32 Bike feeling tight, needs an oil and pump " ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('data/cycling_data.csv')\n", "df" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Date 1 Oct 2019, 00:15:07\n", "Name Afternoon Ride\n", "Type Ride\n", "Time 1712\n", "Distance 11.79\n", "Comments A little tired today but good weather\n", "dtype: object" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.min()" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1712" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Time'].min()" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "20" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Time'].idxmin()" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Date 27 Sep 2019, 01:00:18\n", "Name Afternoon Ride\n", "Type Ride\n", "Time 1712\n", "Distance 12.47\n", "Comments Tired by the end of the week\n", "Name: 20, dtype: object" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[20]" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Date 10 Sep 2019, 00:13:0410 Sep 2019, 13:52:1811 S...\n", "Name Afternoon RideMorning RideAfternoon RideMornin...\n", "Type RideRideRideRideRideRideRideRideRideRideRideRi...\n", "Time 115922\n", "Distance 392.69\n", "Comments RainrainWet road but nice weatherStopped for p...\n", "dtype: object" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some methods like `.mean()` will only operate on numeric columns:" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Time 3512.787879\n", "Distance 12.667419\n", "dtype: float64" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some methods require arguments to be specified, like `.sort_values()`:" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateNameTypeTimeDistanceComments
2027 Sep 2019, 01:00:18Afternoon RideRide171212.47Tired by the end of the week
263 Oct 2019, 00:45:22Afternoon RideRide172412.52Feeling good
221 Oct 2019, 00:15:07Afternoon RideRide1732NaNLegs feeling strong!
242 Oct 2019, 00:13:09Afternoon RideRide1756NaNA little tired today but good weather
1625 Sep 2019, 00:07:21Afternoon RideRide177512.10Feeling really tired
3010 Oct 2019, 00:10:31Afternoon RideRide184112.59Feeling good after a holiday break!
3211 Oct 2019, 00:16:57Afternoon RideRide184311.79Bike feeling tight, needs an oil and pump
1826 Sep 2019, 00:13:33Afternoon RideRide186012.52raining
211 Sep 2019, 00:23:50Afternoon RideRide186312.52Wet road but nice weather
284 Oct 2019, 01:08:08Afternoon RideRide187012.63Very tired, riding into the wind
412 Sep 2019, 00:28:05Afternoon RideRide189112.48Tired by the end of the week
617 Sep 2019, 00:15:47Afternoon RideRide197312.45Legs feeling strong!
1424 Sep 2019, 00:35:42Afternoon RideRide207612.47Oiled chain, bike feels smooth
010 Sep 2019, 00:13:04Afternoon RideRide208412.62Rain
1119 Sep 2019, 13:52:09Morning RideRide209012.59Getting colder which is nice
918 Sep 2019, 00:15:52Afternoon RideRide210112.48Pumped up tires
2130 Sep 2019, 13:53:52Morning RideRide211812.71Rested after the weekend!
1725 Sep 2019, 13:35:41Morning RideRide212412.65Stopped for photo of sunrise
252 Oct 2019, 13:46:06Morning RideRide213413.06Bit tired today but good weather
299 Oct 2019, 13:55:40Morning RideRide214912.70Really cold! But feeling good
273 Oct 2019, 13:47:36Morning RideRide218212.68Wet road
311 Sep 2019, 14:06:19Morning RideRide219212.84Stopped for photo of sunrise
231 Oct 2019, 13:45:55Morning RideRide222212.82Beautiful morning! Feeling fit
516 Sep 2019, 13:57:48Morning RideRide227212.45Rested after the weekend!
717 Sep 2019, 13:43:34Morning RideRide228512.60Raining
1524 Sep 2019, 13:41:24Morning RideRide232112.68Bike feeling much smoother
1926 Sep 2019, 13:42:43Morning RideRide235012.91Detour around trucks at Jericho
1323 Sep 2019, 13:50:41Morning RideRide246212.68Rested after the weekend!
3110 Oct 2019, 13:47:14Morning RideRide246312.79Stopped for photo of sunrise
110 Sep 2019, 13:52:18Morning RideRide253113.03rain
818 Sep 2019, 13:49:53Morning RideRide290314.57Raining today
1220 Sep 2019, 01:02:05Afternoon RideRide296112.81Feeling good
1019 Sep 2019, 00:30:01Afternoon RideRide4806212.48Feeling good
\n", "
" ], "text/plain": [ " Date Name Type Time Distance \\\n", "20 27 Sep 2019, 01:00:18 Afternoon Ride Ride 1712 12.47 \n", "26 3 Oct 2019, 00:45:22 Afternoon Ride Ride 1724 12.52 \n", "22 1 Oct 2019, 00:15:07 Afternoon Ride Ride 1732 NaN \n", "24 2 Oct 2019, 00:13:09 Afternoon Ride Ride 1756 NaN \n", "16 25 Sep 2019, 00:07:21 Afternoon Ride Ride 1775 12.10 \n", "30 10 Oct 2019, 00:10:31 Afternoon Ride Ride 1841 12.59 \n", "32 11 Oct 2019, 00:16:57 Afternoon Ride Ride 1843 11.79 \n", "18 26 Sep 2019, 00:13:33 Afternoon Ride Ride 1860 12.52 \n", "2 11 Sep 2019, 00:23:50 Afternoon Ride Ride 1863 12.52 \n", "28 4 Oct 2019, 01:08:08 Afternoon Ride Ride 1870 12.63 \n", "4 12 Sep 2019, 00:28:05 Afternoon Ride Ride 1891 12.48 \n", "6 17 Sep 2019, 00:15:47 Afternoon Ride Ride 1973 12.45 \n", "14 24 Sep 2019, 00:35:42 Afternoon Ride Ride 2076 12.47 \n", "0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 \n", "11 19 Sep 2019, 13:52:09 Morning Ride Ride 2090 12.59 \n", "9 18 Sep 2019, 00:15:52 Afternoon Ride Ride 2101 12.48 \n", "21 30 Sep 2019, 13:53:52 Morning Ride Ride 2118 12.71 \n", "17 25 Sep 2019, 13:35:41 Morning Ride Ride 2124 12.65 \n", "25 2 Oct 2019, 13:46:06 Morning Ride Ride 2134 13.06 \n", "29 9 Oct 2019, 13:55:40 Morning Ride Ride 2149 12.70 \n", "27 3 Oct 2019, 13:47:36 Morning Ride Ride 2182 12.68 \n", "3 11 Sep 2019, 14:06:19 Morning Ride Ride 2192 12.84 \n", "23 1 Oct 2019, 13:45:55 Morning Ride Ride 2222 12.82 \n", "5 16 Sep 2019, 13:57:48 Morning Ride Ride 2272 12.45 \n", "7 17 Sep 2019, 13:43:34 Morning Ride Ride 2285 12.60 \n", "15 24 Sep 2019, 13:41:24 Morning Ride Ride 2321 12.68 \n", "19 26 Sep 2019, 13:42:43 Morning Ride Ride 2350 12.91 \n", "13 23 Sep 2019, 13:50:41 Morning Ride Ride 2462 12.68 \n", "31 10 Oct 2019, 13:47:14 Morning Ride Ride 2463 12.79 \n", "1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 \n", "8 18 Sep 2019, 13:49:53 Morning Ride Ride 2903 14.57 \n", "12 20 Sep 2019, 01:02:05 Afternoon Ride Ride 2961 12.81 \n", "10 19 Sep 2019, 00:30:01 Afternoon Ride Ride 48062 12.48 \n", "\n", " Comments \n", "20 Tired by the end of the week \n", "26 Feeling good \n", "22 Legs feeling strong! \n", "24 A little tired today but good weather \n", "16 Feeling really tired \n", "30 Feeling good after a holiday break! \n", "32 Bike feeling tight, needs an oil and pump \n", "18 raining \n", "2 Wet road but nice weather \n", "28 Very tired, riding into the wind \n", "4 Tired by the end of the week \n", "6 Legs feeling strong! \n", "14 Oiled chain, bike feels smooth \n", "0 Rain \n", "11 Getting colder which is nice \n", "9 Pumped up tires \n", "21 Rested after the weekend! \n", "17 Stopped for photo of sunrise \n", "25 Bit tired today but good weather \n", "29 Really cold! But feeling good \n", "27 Wet road \n", "3 Stopped for photo of sunrise \n", "23 Beautiful morning! Feeling fit \n", "5 Rested after the weekend! \n", "7 Raining \n", "15 Bike feeling much smoother \n", "19 Detour around trucks at Jericho \n", "13 Rested after the weekend! \n", "31 Stopped for photo of sunrise \n", "1 rain \n", "8 Raining today \n", "12 Feeling good \n", "10 Feeling good " ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sort_values(by='Time')" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateNameTypeTimeDistanceComments
1019 Sep 2019, 00:30:01Afternoon RideRide4806212.48Feeling good
1220 Sep 2019, 01:02:05Afternoon RideRide296112.81Feeling good
818 Sep 2019, 13:49:53Morning RideRide290314.57Raining today
110 Sep 2019, 13:52:18Morning RideRide253113.03rain
3110 Oct 2019, 13:47:14Morning RideRide246312.79Stopped for photo of sunrise
1323 Sep 2019, 13:50:41Morning RideRide246212.68Rested after the weekend!
1926 Sep 2019, 13:42:43Morning RideRide235012.91Detour around trucks at Jericho
1524 Sep 2019, 13:41:24Morning RideRide232112.68Bike feeling much smoother
717 Sep 2019, 13:43:34Morning RideRide228512.60Raining
516 Sep 2019, 13:57:48Morning RideRide227212.45Rested after the weekend!
231 Oct 2019, 13:45:55Morning RideRide222212.82Beautiful morning! Feeling fit
311 Sep 2019, 14:06:19Morning RideRide219212.84Stopped for photo of sunrise
273 Oct 2019, 13:47:36Morning RideRide218212.68Wet road
299 Oct 2019, 13:55:40Morning RideRide214912.70Really cold! But feeling good
252 Oct 2019, 13:46:06Morning RideRide213413.06Bit tired today but good weather
1725 Sep 2019, 13:35:41Morning RideRide212412.65Stopped for photo of sunrise
2130 Sep 2019, 13:53:52Morning RideRide211812.71Rested after the weekend!
918 Sep 2019, 00:15:52Afternoon RideRide210112.48Pumped up tires
1119 Sep 2019, 13:52:09Morning RideRide209012.59Getting colder which is nice
010 Sep 2019, 00:13:04Afternoon RideRide208412.62Rain
1424 Sep 2019, 00:35:42Afternoon RideRide207612.47Oiled chain, bike feels smooth
617 Sep 2019, 00:15:47Afternoon RideRide197312.45Legs feeling strong!
412 Sep 2019, 00:28:05Afternoon RideRide189112.48Tired by the end of the week
284 Oct 2019, 01:08:08Afternoon RideRide187012.63Very tired, riding into the wind
211 Sep 2019, 00:23:50Afternoon RideRide186312.52Wet road but nice weather
1826 Sep 2019, 00:13:33Afternoon RideRide186012.52raining
3211 Oct 2019, 00:16:57Afternoon RideRide184311.79Bike feeling tight, needs an oil and pump
3010 Oct 2019, 00:10:31Afternoon RideRide184112.59Feeling good after a holiday break!
1625 Sep 2019, 00:07:21Afternoon RideRide177512.10Feeling really tired
242 Oct 2019, 00:13:09Afternoon RideRide1756NaNA little tired today but good weather
221 Oct 2019, 00:15:07Afternoon RideRide1732NaNLegs feeling strong!
263 Oct 2019, 00:45:22Afternoon RideRide172412.52Feeling good
2027 Sep 2019, 01:00:18Afternoon RideRide171212.47Tired by the end of the week
\n", "
" ], "text/plain": [ " Date Name Type Time Distance \\\n", "10 19 Sep 2019, 00:30:01 Afternoon Ride Ride 48062 12.48 \n", "12 20 Sep 2019, 01:02:05 Afternoon Ride Ride 2961 12.81 \n", "8 18 Sep 2019, 13:49:53 Morning Ride Ride 2903 14.57 \n", "1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 \n", "31 10 Oct 2019, 13:47:14 Morning Ride Ride 2463 12.79 \n", "13 23 Sep 2019, 13:50:41 Morning Ride Ride 2462 12.68 \n", "19 26 Sep 2019, 13:42:43 Morning Ride Ride 2350 12.91 \n", "15 24 Sep 2019, 13:41:24 Morning Ride Ride 2321 12.68 \n", "7 17 Sep 2019, 13:43:34 Morning Ride Ride 2285 12.60 \n", "5 16 Sep 2019, 13:57:48 Morning Ride Ride 2272 12.45 \n", "23 1 Oct 2019, 13:45:55 Morning Ride Ride 2222 12.82 \n", "3 11 Sep 2019, 14:06:19 Morning Ride Ride 2192 12.84 \n", "27 3 Oct 2019, 13:47:36 Morning Ride Ride 2182 12.68 \n", "29 9 Oct 2019, 13:55:40 Morning Ride Ride 2149 12.70 \n", "25 2 Oct 2019, 13:46:06 Morning Ride Ride 2134 13.06 \n", "17 25 Sep 2019, 13:35:41 Morning Ride Ride 2124 12.65 \n", "21 30 Sep 2019, 13:53:52 Morning Ride Ride 2118 12.71 \n", "9 18 Sep 2019, 00:15:52 Afternoon Ride Ride 2101 12.48 \n", "11 19 Sep 2019, 13:52:09 Morning Ride Ride 2090 12.59 \n", "0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 \n", "14 24 Sep 2019, 00:35:42 Afternoon Ride Ride 2076 12.47 \n", "6 17 Sep 2019, 00:15:47 Afternoon Ride Ride 1973 12.45 \n", "4 12 Sep 2019, 00:28:05 Afternoon Ride Ride 1891 12.48 \n", "28 4 Oct 2019, 01:08:08 Afternoon Ride Ride 1870 12.63 \n", "2 11 Sep 2019, 00:23:50 Afternoon Ride Ride 1863 12.52 \n", "18 26 Sep 2019, 00:13:33 Afternoon Ride Ride 1860 12.52 \n", "32 11 Oct 2019, 00:16:57 Afternoon Ride Ride 1843 11.79 \n", "30 10 Oct 2019, 00:10:31 Afternoon Ride Ride 1841 12.59 \n", "16 25 Sep 2019, 00:07:21 Afternoon Ride Ride 1775 12.10 \n", "24 2 Oct 2019, 00:13:09 Afternoon Ride Ride 1756 NaN \n", "22 1 Oct 2019, 00:15:07 Afternoon Ride Ride 1732 NaN \n", "26 3 Oct 2019, 00:45:22 Afternoon Ride Ride 1724 12.52 \n", "20 27 Sep 2019, 01:00:18 Afternoon Ride Ride 1712 12.47 \n", "\n", " Comments \n", "10 Feeling good \n", "12 Feeling good \n", "8 Raining today \n", "1 rain \n", "31 Stopped for photo of sunrise \n", "13 Rested after the weekend! \n", "19 Detour around trucks at Jericho \n", "15 Bike feeling much smoother \n", "7 Raining \n", "5 Rested after the weekend! \n", "23 Beautiful morning! Feeling fit \n", "3 Stopped for photo of sunrise \n", "27 Wet road \n", "29 Really cold! But feeling good \n", "25 Bit tired today but good weather \n", "17 Stopped for photo of sunrise \n", "21 Rested after the weekend! \n", "9 Pumped up tires \n", "11 Getting colder which is nice \n", "0 Rain \n", "14 Oiled chain, bike feels smooth \n", "6 Legs feeling strong! \n", "4 Tired by the end of the week \n", "28 Very tired, riding into the wind \n", "2 Wet road but nice weather \n", "18 raining \n", "32 Bike feeling tight, needs an oil and pump \n", "30 Feeling good after a holiday break! \n", "16 Feeling really tired \n", "24 A little tired today but good weather \n", "22 Legs feeling strong! \n", "26 Feeling good \n", "20 Tired by the end of the week " ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sort_values(by='Time', ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some methods will operate on the index/columns, like `.sort_index()`:" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateNameTypeTimeDistanceComments
3211 Oct 2019, 00:16:57Afternoon RideRide184311.79Bike feeling tight, needs an oil and pump
3110 Oct 2019, 13:47:14Morning RideRide246312.79Stopped for photo of sunrise
3010 Oct 2019, 00:10:31Afternoon RideRide184112.59Feeling good after a holiday break!
299 Oct 2019, 13:55:40Morning RideRide214912.70Really cold! But feeling good
284 Oct 2019, 01:08:08Afternoon RideRide187012.63Very tired, riding into the wind
273 Oct 2019, 13:47:36Morning RideRide218212.68Wet road
263 Oct 2019, 00:45:22Afternoon RideRide172412.52Feeling good
252 Oct 2019, 13:46:06Morning RideRide213413.06Bit tired today but good weather
242 Oct 2019, 00:13:09Afternoon RideRide1756NaNA little tired today but good weather
231 Oct 2019, 13:45:55Morning RideRide222212.82Beautiful morning! Feeling fit
221 Oct 2019, 00:15:07Afternoon RideRide1732NaNLegs feeling strong!
2130 Sep 2019, 13:53:52Morning RideRide211812.71Rested after the weekend!
2027 Sep 2019, 01:00:18Afternoon RideRide171212.47Tired by the end of the week
1926 Sep 2019, 13:42:43Morning RideRide235012.91Detour around trucks at Jericho
1826 Sep 2019, 00:13:33Afternoon RideRide186012.52raining
1725 Sep 2019, 13:35:41Morning RideRide212412.65Stopped for photo of sunrise
1625 Sep 2019, 00:07:21Afternoon RideRide177512.10Feeling really tired
1524 Sep 2019, 13:41:24Morning RideRide232112.68Bike feeling much smoother
1424 Sep 2019, 00:35:42Afternoon RideRide207612.47Oiled chain, bike feels smooth
1323 Sep 2019, 13:50:41Morning RideRide246212.68Rested after the weekend!
1220 Sep 2019, 01:02:05Afternoon RideRide296112.81Feeling good
1119 Sep 2019, 13:52:09Morning RideRide209012.59Getting colder which is nice
1019 Sep 2019, 00:30:01Afternoon RideRide4806212.48Feeling good
918 Sep 2019, 00:15:52Afternoon RideRide210112.48Pumped up tires
818 Sep 2019, 13:49:53Morning RideRide290314.57Raining today
717 Sep 2019, 13:43:34Morning RideRide228512.60Raining
617 Sep 2019, 00:15:47Afternoon RideRide197312.45Legs feeling strong!
516 Sep 2019, 13:57:48Morning RideRide227212.45Rested after the weekend!
412 Sep 2019, 00:28:05Afternoon RideRide189112.48Tired by the end of the week
311 Sep 2019, 14:06:19Morning RideRide219212.84Stopped for photo of sunrise
211 Sep 2019, 00:23:50Afternoon RideRide186312.52Wet road but nice weather
110 Sep 2019, 13:52:18Morning RideRide253113.03rain
010 Sep 2019, 00:13:04Afternoon RideRide208412.62Rain
\n", "
" ], "text/plain": [ " Date Name Type Time Distance \\\n", "32 11 Oct 2019, 00:16:57 Afternoon Ride Ride 1843 11.79 \n", "31 10 Oct 2019, 13:47:14 Morning Ride Ride 2463 12.79 \n", "30 10 Oct 2019, 00:10:31 Afternoon Ride Ride 1841 12.59 \n", "29 9 Oct 2019, 13:55:40 Morning Ride Ride 2149 12.70 \n", "28 4 Oct 2019, 01:08:08 Afternoon Ride Ride 1870 12.63 \n", "27 3 Oct 2019, 13:47:36 Morning Ride Ride 2182 12.68 \n", "26 3 Oct 2019, 00:45:22 Afternoon Ride Ride 1724 12.52 \n", "25 2 Oct 2019, 13:46:06 Morning Ride Ride 2134 13.06 \n", "24 2 Oct 2019, 00:13:09 Afternoon Ride Ride 1756 NaN \n", "23 1 Oct 2019, 13:45:55 Morning Ride Ride 2222 12.82 \n", "22 1 Oct 2019, 00:15:07 Afternoon Ride Ride 1732 NaN \n", "21 30 Sep 2019, 13:53:52 Morning Ride Ride 2118 12.71 \n", "20 27 Sep 2019, 01:00:18 Afternoon Ride Ride 1712 12.47 \n", "19 26 Sep 2019, 13:42:43 Morning Ride Ride 2350 12.91 \n", "18 26 Sep 2019, 00:13:33 Afternoon Ride Ride 1860 12.52 \n", "17 25 Sep 2019, 13:35:41 Morning Ride Ride 2124 12.65 \n", "16 25 Sep 2019, 00:07:21 Afternoon Ride Ride 1775 12.10 \n", "15 24 Sep 2019, 13:41:24 Morning Ride Ride 2321 12.68 \n", "14 24 Sep 2019, 00:35:42 Afternoon Ride Ride 2076 12.47 \n", "13 23 Sep 2019, 13:50:41 Morning Ride Ride 2462 12.68 \n", "12 20 Sep 2019, 01:02:05 Afternoon Ride Ride 2961 12.81 \n", "11 19 Sep 2019, 13:52:09 Morning Ride Ride 2090 12.59 \n", "10 19 Sep 2019, 00:30:01 Afternoon Ride Ride 48062 12.48 \n", "9 18 Sep 2019, 00:15:52 Afternoon Ride Ride 2101 12.48 \n", "8 18 Sep 2019, 13:49:53 Morning Ride Ride 2903 14.57 \n", "7 17 Sep 2019, 13:43:34 Morning Ride Ride 2285 12.60 \n", "6 17 Sep 2019, 00:15:47 Afternoon Ride Ride 1973 12.45 \n", "5 16 Sep 2019, 13:57:48 Morning Ride Ride 2272 12.45 \n", "4 12 Sep 2019, 00:28:05 Afternoon Ride Ride 1891 12.48 \n", "3 11 Sep 2019, 14:06:19 Morning Ride Ride 2192 12.84 \n", "2 11 Sep 2019, 00:23:50 Afternoon Ride Ride 1863 12.52 \n", "1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 \n", "0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 \n", "\n", " Comments \n", "32 Bike feeling tight, needs an oil and pump \n", "31 Stopped for photo of sunrise \n", "30 Feeling good after a holiday break! \n", "29 Really cold! But feeling good \n", "28 Very tired, riding into the wind \n", "27 Wet road \n", "26 Feeling good \n", "25 Bit tired today but good weather \n", "24 A little tired today but good weather \n", "23 Beautiful morning! Feeling fit \n", "22 Legs feeling strong! \n", "21 Rested after the weekend! \n", "20 Tired by the end of the week \n", "19 Detour around trucks at Jericho \n", "18 raining \n", "17 Stopped for photo of sunrise \n", "16 Feeling really tired \n", "15 Bike feeling much smoother \n", "14 Oiled chain, bike feels smooth \n", "13 Rested after the weekend! \n", "12 Feeling good \n", "11 Getting colder which is nice \n", "10 Feeling good \n", "9 Pumped up tires \n", "8 Raining today \n", "7 Raining \n", "6 Legs feeling strong! \n", "5 Rested after the weekend! \n", "4 Tired by the end of the week \n", "3 Stopped for photo of sunrise \n", "2 Wet road but nice weather \n", "1 rain \n", "0 Rain " ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sort_index(ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Why ndarrays and Series and DataFrames?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At this point, you might be asking why we need all these different data structures. Well, they all serve different purposes and are suited to different tasks. For example:\n", "- NumPy is typically faster/uses less memory than Pandas;\n", "- not all Python packages are compatible with NumPy & Pandas;\n", "- the ability to add labels to data can be useful (e.g., for time series);\n", "- NumPy and Pandas have different built-in functions available.\n", "\n", "My advice: use the simplest data structure that fulfills your needs!\n", "\n", "Finally, we've seen how to go from: ndarray (`np.array()`) -> series (`pd.series()`) -> dataframe (`pd.DataFrame()`). Remember that we can also go the other way: dataframe/series -> ndarray using `df.to_numpy()`." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.8" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": true, "title_cell": "Lecture Outline", "title_sidebar": "Contents", "toc_cell": true, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "305.797px" }, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 4 }