Basic Wrangling With Pandas

Tomas Beuzen, September 2020

These exercises complement Chapter 8.

Exercises

1.

In this set of practice exercises we’ll be again looking at the dataset of consumption and carbon footprints of different foods that we looked at in the last set of practice exercises, which was compiled by Kasia Kulma and contributed to R’s Tidy Tuesday project.

Let’s start by importing pandas with the alias pd.

# Your answer here.

2.

As a reminder, the dataset has the following columns:

column

description

country

Country Name

food_category

Food Category

consumption

Consumption (kg/person/year)

co2_emmission

Co2 Emission (Kg CO2/person/year)

Import the dataset as a dataframe named df from this url: https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-18/food_consumption.csv

# Your answer here.

3.

What country consumes the most food per person per year (across all food categories)?

# Your answer here.

4.

Which food category is the biggest contributor to the above country’s consumption total?

# Your answer here.

5.

What country produces the most kg C02 per person per year?

# Your answer here.

6.

Which food category is the biggest contributor to the above country’s C02 emissions?

# Your answer here.

7.

What food category produces the most C02 per person per year across all countries?

# Your answer here.

8.

What food category is consumed the most across all countries per person per year? What food category is consumed the least across all countries?

# Your answer here.

9.

Make the dataset wide by pivoting on the food_category column. You’ll end up with a “multi-index” dataframe, with multiple levels of columns.

# Your answer here.

10.

Now that the dataset is wide, I want you to answer the same question from Question 5 above: “What country produces the most kg C02 per person per year?”. Specifically, I want you to notice that the way we answer the same data analysis question changes depending on the format of the data (wide vs long). You can form your own opinion on which option you prefer - I prefer long data (and remember that many visualization libraries work best with long data too - more on that in DSCI 531). Hint: you can index the outer layer of a multi-index column using the same syntax we’ve seen previously: df['co2_emmission'], if you wanted to access an inner index, you’d have to use a tuple: df[("consumption", "Beef")]. Read more on multi-indexes (also called “hierarchical indexes”) in Chapter 9.

# Your answer here.



Solutions

1.

In this set of practice exercises we’ll be again looking at the dataset of consumption and carbon footprints of different foods that we looked at in the last set of practice exercises, which was compiled by Kasia Kulma and contributed to R’s Tidy Tuesday project.

Let’s start by importing pandas with the alias pd.

import pandas as pd

2.

As a reminder, the dataset has the following columns:

column

description

country

Country Name

food_category

Food Category

consumption

Consumption (kg/person/year)

co2_emmission

Co2 Emission (Kg CO2/person/year)

Import the dataset as a dataframe named df from this url: https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-18/food_consumption.csv

url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-18/food_consumption.csv"
df = pd.read_csv(url)
df
country food_category consumption co2_emmission
0 Argentina Pork 10.51 37.20
1 Argentina Poultry 38.66 41.53
2 Argentina Beef 55.48 1712.00
3 Argentina Lamb & Goat 1.56 54.63
4 Argentina Fish 4.36 6.96
... ... ... ... ...
1425 Bangladesh Milk - inc. cheese 21.91 31.21
1426 Bangladesh Wheat and Wheat Products 17.47 3.33
1427 Bangladesh Rice 171.73 219.76
1428 Bangladesh Soybeans 0.61 0.27
1429 Bangladesh Nuts inc. Peanut Butter 0.72 1.27

1430 rows × 4 columns

3.

What country consumes the most food per person per year (across all food categories)?

df.groupby("country").sum().sort_values(by="consumption").tail(1)
consumption co2_emmission
country
Finland 639.79 1464.63

4.

Which food category is the biggest contributor to the above country’s consumption total?

df.query("country == 'Finland'").sort_values(by="consumption").tail(1)
country food_category consumption co2_emmission
149 Finland Milk - inc. cheese 430.76 613.57

5.

What country produces the most kg C02 per person per year?

df.groupby("country").sum().sort_values(by="co2_emmission").tail(1)
consumption co2_emmission
country
Argentina 429.41 2172.4

6.

Which food category is the biggest contributor to the above country’s C02 emissions?

df.query("country == 'Argentina'").sort_values(by="co2_emmission").tail(1)
country food_category consumption co2_emmission
2 Argentina Beef 55.48 1712.0

7.

What food category produces the most C02 per person per year across all countries?

df.groupby("food_category").sum().sort_values(by="co2_emmission", ascending=False)
consumption co2_emmission
food_category
Beef 1576.04 48633.26
Milk - inc. cheese 16350.71 23290.00
Lamb & Goat 338.02 11837.38
Pork 2096.08 7419.11
Rice 3818.77 4886.91
Fish 2247.32 3588.22
Poultry 2758.50 2963.16
Wheat and Wheat Products 9301.44 1773.78
Eggs 1061.29 974.95
Nuts inc. Peanut Butter 537.84 951.99
Soybeans 111.87 50.35

8.

What food category is consumed the most across all countries per person per year? What food category is consumed the least across all countries?

print("Most consumption:")
print(df.groupby("food_category").sum().sort_values(by="consumption", ascending=False).head(1))
print("")
print("Least consumption:")
print(df.groupby("food_category").sum().sort_values(by="consumption", ascending=False).tail(1))
Most consumption:
                    consumption  co2_emmission
food_category                                 
Milk - inc. cheese     16350.71        23290.0

Least consumption:
               consumption  co2_emmission
food_category                            
Soybeans            111.87          50.35

9.

Make the dataset wide by pivoting on the food_category column. You’ll end up with a “multi-index” dataframe, with multiple levels of columns.

df = df.pivot(index='country', columns='food_category')
df
consumption ... co2_emmission
food_category Beef Eggs Fish Lamb & Goat Milk - inc. cheese Nuts inc. Peanut Butter Pork Poultry Rice Soybeans ... Eggs Fish Lamb & Goat Milk - inc. cheese Nuts inc. Peanut Butter Pork Poultry Rice Soybeans Wheat and Wheat Products
country
Albania 22.50 12.45 3.85 15.32 303.72 4.36 10.88 13.23 7.78 0.00 ... 11.44 6.15 536.50 432.62 7.72 38.51 14.21 9.96 0.00 26.44
Algeria 5.60 8.06 3.74 7.69 141.53 2.08 0.00 7.42 2.97 0.00 ... 7.40 5.97 269.30 201.60 3.68 0.00 7.97 3.80 0.00 35.36
Angola 8.42 1.11 15.24 1.08 12.30 2.26 8.89 17.33 8.12 0.52 ... 1.02 24.33 37.82 17.52 4.00 31.47 18.62 10.39 0.23 7.77
Argentina 55.48 11.39 4.36 1.56 195.08 0.49 10.51 38.66 8.77 0.00 ... 10.46 6.96 54.63 277.87 0.87 37.20 41.53 11.22 0.00 19.66
Armenia 19.66 11.69 4.36 3.02 209.03 2.55 9.67 13.35 3.18 0.00 ... 10.74 6.96 105.76 297.74 4.51 34.23 14.34 4.07 0.00 24.91
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Uruguay 29.10 13.14 6.53 8.23 210.54 0.95 16.84 27.45 11.50 0.01 ... 12.07 10.43 288.21 299.89 1.68 59.61 29.49 14.72 0.00 20.85
Venezuela 25.89 5.63 8.34 0.32 117.79 0.35 7.23 39.28 23.39 0.00 ... 5.17 13.32 11.21 167.78 0.62 25.59 42.19 29.93 0.00 9.38
Vietnam 7.44 3.84 26.52 0.14 16.36 6.28 35.00 12.36 144.56 5.75 ... 3.53 42.34 4.90 23.30 11.12 123.88 13.28 184.99 2.59 2.00
Zambia 4.76 3.32 6.20 0.68 9.71 5.04 1.66 3.29 3.05 7.30 ... 3.05 9.90 23.81 13.83 8.92 5.88 3.53 3.90 3.29 2.31
Zimbabwe 7.37 1.75 2.80 0.96 31.90 2.22 2.65 4.97 10.09 0.54 ... 1.61 4.47 33.62 45.44 3.93 9.38 5.34 12.91 0.24 5.97

130 rows × 22 columns

10.

Now that the dataset is wide, I want you to answer the same question from Question 5 above: “What country produces the most kg C02 per person per year?”. Specifically, I want you to notice that the way we answer the same data analysis question changes depending on the format of the data (wide vs long). You can form your own opinion on which option you prefer - I prefer long data (and remember that many visualization libraries work best with long data too - more on that in DSCI 531). Hint: you can index the outer layer of a multi-index column using the same syntax we’ve seen previously: df['co2_emmission'], if you wanted to access an inner index, you’d have to use a tuple: df[("consumption", "Beef")]. Read more on multi-indexes (also called “hierarchical indexes”) in Chapter 9.

df["consumption"].sum(axis=1).sort_values().tail(1)
country
Finland    639.79
dtype: float64