Basic Wrangling With Pandas¶

Tomas Beuzen, September 2020

These exercises complement Chapter 8.

Exercises¶

1.¶

In this set of practice exercises we’ll be again looking at the dataset of consumption and carbon footprints of different foods that we looked at in the last set of practice exercises, which was compiled by Kasia Kulma and contributed to R’s Tidy Tuesday project.

Let’s start by importing pandas with the alias pd.

# Your answer here.

2.¶

As a reminder, the dataset has the following columns:

column	description
country	Country Name
food_category	Food Category
consumption	Consumption (kg/person/year)
co2_emmission	Co2 Emission (Kg CO2/person/year)

Import the dataset as a dataframe named df from this url: https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-18/food_consumption.csv

# Your answer here.

3.¶

What country consumes the most food per person per year (across all food categories)?

# Your answer here.

4.¶

Which food category is the biggest contributor to the above country’s consumption total?

# Your answer here.

5.¶

What country produces the most kg C02 per person per year?

# Your answer here.

6.¶

Which food category is the biggest contributor to the above country’s C02 emissions?

# Your answer here.

7.¶

What food category produces the most C02 per person per year across all countries?

# Your answer here.

8.¶

What food category is consumed the most across all countries per person per year? What food category is consumed the least across all countries?

# Your answer here.

9.¶

Make the dataset wide by pivoting on the food_category column. You’ll end up with a “multi-index” dataframe, with multiple levels of columns.

# Your answer here.

10.¶

Now that the dataset is wide, I want you to answer the same question from Question 5 above: “What country produces the most kg C02 per person per year?”. Specifically, I want you to notice that the way we answer the same data analysis question changes depending on the format of the data (wide vs long). You can form your own opinion on which option you prefer - I prefer long data (and remember that many visualization libraries work best with long data too - more on that in DSCI 531). Hint: you can index the outer layer of a multi-index column using the same syntax we’ve seen previously: df['co2_emmission'], if you wanted to access an inner index, you’d have to use a tuple: df[("consumption", "Beef")]. Read more on multi-indexes (also called “hierarchical indexes”) in Chapter 9.

# Your answer here.

Solutions¶

1.¶

In this set of practice exercises we’ll be again looking at the dataset of consumption and carbon footprints of different foods that we looked at in the last set of practice exercises, which was compiled by Kasia Kulma and contributed to R’s Tidy Tuesday project.

Let’s start by importing pandas with the alias pd.

import pandas as pd

2.¶

As a reminder, the dataset has the following columns:

column	description
country	Country Name
food_category	Food Category
consumption	Consumption (kg/person/year)
co2_emmission	Co2 Emission (Kg CO2/person/year)

Import the dataset as a dataframe named df from this url: https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-18/food_consumption.csv

url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-18/food_consumption.csv"
df = pd.read_csv(url)
df

	country	food_category	consumption	co2_emmission
0	Argentina	Pork	10.51	37.20
1	Argentina	Poultry	38.66	41.53
2	Argentina	Beef	55.48	1712.00
3	Argentina	Lamb & Goat	1.56	54.63
4	Argentina	Fish	4.36	6.96
...	...	...	...	...
1425	Bangladesh	Milk - inc. cheese	21.91	31.21
1426	Bangladesh	Wheat and Wheat Products	17.47	3.33
1427	Bangladesh	Rice	171.73	219.76
1428	Bangladesh	Soybeans	0.61	0.27
1429	Bangladesh	Nuts inc. Peanut Butter	0.72	1.27

1430 rows × 4 columns

3.¶

What country consumes the most food per person per year (across all food categories)?

df.groupby("country").sum().sort_values(by="consumption").tail(1)

	consumption	co2_emmission
country
Finland	639.79	1464.63

4.¶

Which food category is the biggest contributor to the above country’s consumption total?

df.query("country == 'Finland'").sort_values(by="consumption").tail(1)

	country	food_category	consumption	co2_emmission
149	Finland	Milk - inc. cheese	430.76	613.57

5.¶

What country produces the most kg C02 per person per year?

df.groupby("country").sum().sort_values(by="co2_emmission").tail(1)

	consumption	co2_emmission
country
Argentina	429.41	2172.4

6.¶

Which food category is the biggest contributor to the above country’s C02 emissions?

df.query("country == 'Argentina'").sort_values(by="co2_emmission").tail(1)

	country	food_category	consumption	co2_emmission
2	Argentina	Beef	55.48	1712.0

7.¶

What food category produces the most C02 per person per year across all countries?

df.groupby("food_category").sum().sort_values(by="co2_emmission", ascending=False)

	consumption	co2_emmission
food_category
Beef	1576.04	48633.26
Milk - inc. cheese	16350.71	23290.00
Lamb & Goat	338.02	11837.38
Pork	2096.08	7419.11
Rice	3818.77	4886.91
Fish	2247.32	3588.22
Poultry	2758.50	2963.16
Wheat and Wheat Products	9301.44	1773.78
Eggs	1061.29	974.95
Nuts inc. Peanut Butter	537.84	951.99
Soybeans	111.87	50.35

8.¶

What food category is consumed the most across all countries per person per year? What food category is consumed the least across all countries?

print("Most consumption:")
print(df.groupby("food_category").sum().sort_values(by="consumption", ascending=False).head(1))
print("")
print("Least consumption:")
print(df.groupby("food_category").sum().sort_values(by="consumption", ascending=False).tail(1))

Most consumption:
                    consumption  co2_emmission
food_category                                 
Milk - inc. cheese     16350.71        23290.0

Least consumption:
               consumption  co2_emmission
food_category                            
Soybeans            111.87          50.35

9.¶

Make the dataset wide by pivoting on the food_category column. You’ll end up with a “multi-index” dataframe, with multiple levels of columns.

df = df.pivot(index='country', columns='food_category')
df

	consumption										...	co2_emmission
food_category	Beef	Eggs	Fish	Lamb & Goat	Milk - inc. cheese	Nuts inc. Peanut Butter	Pork	Poultry	Rice	Soybeans	...	Eggs	Fish	Lamb & Goat	Milk - inc. cheese	Nuts inc. Peanut Butter	Pork	Poultry	Rice	Soybeans	Wheat and Wheat Products
country
Albania	22.50	12.45	3.85	15.32	303.72	4.36	10.88	13.23	7.78	0.00	...	11.44	6.15	536.50	432.62	7.72	38.51	14.21	9.96	0.00	26.44
Algeria	5.60	8.06	3.74	7.69	141.53	2.08	0.00	7.42	2.97	0.00	...	7.40	5.97	269.30	201.60	3.68	0.00	7.97	3.80	0.00	35.36
Angola	8.42	1.11	15.24	1.08	12.30	2.26	8.89	17.33	8.12	0.52	...	1.02	24.33	37.82	17.52	4.00	31.47	18.62	10.39	0.23	7.77
Argentina	55.48	11.39	4.36	1.56	195.08	0.49	10.51	38.66	8.77	0.00	...	10.46	6.96	54.63	277.87	0.87	37.20	41.53	11.22	0.00	19.66
Armenia	19.66	11.69	4.36	3.02	209.03	2.55	9.67	13.35	3.18	0.00	...	10.74	6.96	105.76	297.74	4.51	34.23	14.34	4.07	0.00	24.91
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
Uruguay	29.10	13.14	6.53	8.23	210.54	0.95	16.84	27.45	11.50	0.01	...	12.07	10.43	288.21	299.89	1.68	59.61	29.49	14.72	0.00	20.85
Venezuela	25.89	5.63	8.34	0.32	117.79	0.35	7.23	39.28	23.39	0.00	...	5.17	13.32	11.21	167.78	0.62	25.59	42.19	29.93	0.00	9.38
Vietnam	7.44	3.84	26.52	0.14	16.36	6.28	35.00	12.36	144.56	5.75	...	3.53	42.34	4.90	23.30	11.12	123.88	13.28	184.99	2.59	2.00
Zambia	4.76	3.32	6.20	0.68	9.71	5.04	1.66	3.29	3.05	7.30	...	3.05	9.90	23.81	13.83	8.92	5.88	3.53	3.90	3.29	2.31
Zimbabwe	7.37	1.75	2.80	0.96	31.90	2.22	2.65	4.97	10.09	0.54	...	1.61	4.47	33.62	45.44	3.93	9.38	5.34	12.91	0.24	5.97

130 rows × 22 columns

10.¶

Now that the dataset is wide, I want you to answer the same question from Question 5 above: “What country produces the most kg C02 per person per year?”. Specifically, I want you to notice that the way we answer the same data analysis question changes depending on the format of the data (wide vs long). You can form your own opinion on which option you prefer - I prefer long data (and remember that many visualization libraries work best with long data too - more on that in DSCI 531). Hint: you can index the outer layer of a multi-index column using the same syntax we’ve seen previously: df['co2_emmission'], if you wanted to access an inner index, you’d have to use a tuple: df[("consumption", "Beef")]. Read more on multi-indexes (also called “hierarchical indexes”) in Chapter 9.

df["consumption"].sum(axis=1).sort_values().tail(1)

country
Finland    639.79
dtype: float64

Python Programming for Data Science

Basic Wrangling With Pandas¶

Exercises¶

1.¶

2.¶

3.¶

4.¶

5.¶

6.¶

7.¶

8.¶

9.¶

10.¶

Solutions¶

1.¶

2.¶

3.¶

4.¶

5.¶

6.¶

7.¶

8.¶

9.¶

10.¶