Wrap-up Challenge

Wrap-up Challenge#

We’ve provided a file called starwars.csv in this folder (you should see it in the file explorer on the left). This is a tidy dataset that you’re going to practice your Polars and Seaborn skills on by answering the following questions.

Note: if you find it helpful to see similar commands on this same data using dplyr in R, they’re available here

For clarity we’re going to re-import everything you normally might when working with real data.

import polars as pl
from polars import col, when, lit
import polars.selectors as cs

import seaborn as sns
import matplotlib.pyplot as plt # for customization if needed

Inspecting data#

Load the file and print the first few rows:

# Your code here
# Solution
sw = pl.read_csv('starwars.csv')
sw.head()
shape: (5, 11)
nameheightmasshair_colorskin_coloreye_colorbirth_yearsexgenderhomeworldspecies
strf64f64strstrstrf64strstrstrstr
"Luke Skywalker"172.077.0"blond""fair""blue"19.0"male""masculine""Tatooine""Human"
"C-3PO"167.075.0null"gold""yellow"112.0"none""masculine""Tatooine""Droid"
"R2-D2"96.032.0null"white, blue""red"33.0"none""masculine""Naboo""Droid"
"Darth Vader"202.0136.0"none""white""yellow"41.9"male""masculine""Tatooine""Human"
"Leia Organa"150.049.0"brown""light""brown"19.0"female""feminine""Alderaan""Human"

How many rows are in the DataFrame total?

# Your code here
# Solution
sw.shape
(87, 11)

How many missing values are in each column?

Hint: checkout .null_count()

# Your code here
# Solution
sw.null_count()
shape: (1, 11)
nameheightmasshair_colorskin_coloreye_colorbirth_yearsexgenderhomeworldspecies
u32u32u32u32u32u32u32u32u32u32u32
06285004444104

How many unique characters are there?

Hint: for quick operations you can directly refer to a rows/columns using slicing [] syntax

# Your code here
# Solution
sw['name'].n_unique()
87

Return a list of the unique “species” in the data and sort them alphabetically

# Your code here
# Solution
sw['species'].unique().sort()
shape: (38,)
species
str
null
"Aleena"
"Besalisk"
"Cerean"
"Chagrian"
"Vulptereen"
"Wookiee"
"Xexto"
"Yoda's species"
"Zabrak"

Return the number of observations for each unique “eye-color”

Hint: checkout .value_counts()

# Your code here
# Solution
sw['eye_color'].value_counts()
shape: (15, 2)
eye_colorcount
stru32
"dark"1
"red, blue"1
"unknown"3
"gold"1
"orange"8
"blue"19
"white"1
"pink"1
"green, yellow"1
"brown"21

Wrangling Data#

Ok lets use the contexts and expressions we’ve learned to about filter and aggregate this data in different ways.

Filter rows to get character’s whose “gender” is “masculine” and “height” is above the median height of all characters.

How many rows in the result?

# Your code here
# Solution
sw.filter(
    col('gender').eq('masculine') & col('height').ge(col('height').median())
)
shape: (39, 11)
nameheightmasshair_colorskin_coloreye_colorbirth_yearsexgenderhomeworldspecies
strf64f64strstrstrf64strstrstrstr
"Darth Vader"202.0136.0"none""white""yellow"41.9"male""masculine""Tatooine""Human"
"Biggs Darklighter"183.084.0"black""light""brown"24.0"male""masculine""Tatooine""Human"
"Obi-Wan Kenobi"182.077.0"auburn, white""fair""blue-gray"57.0"male""masculine""Stewjon""Human"
"Anakin Skywalker"188.084.0"blond""fair""blue"41.9"male""masculine""Tatooine""Human"
"Wilhuff Tarkin"180.0null"auburn, grey""fair""blue"64.0"male""masculine""Eriadu""Human"
"San Hill"191.0null"none""grey""gold"null"male""masculine""Muunilinst""Muun"
"Grievous"216.0159.0"none""brown, white""green, yellow"null"male""masculine""Kalee""Kaleesh"
"Tarfful"234.0136.0"brown""brown""blue"null"male""masculine""Kashyyyk""Wookiee"
"Raymus Antilles"188.079.0"brown""light""brown"null"male""masculine""Alderaan""Human"
"Tion Medon"206.080.0"none""grey""black"null"male""masculine""Utapau""Pau'an"

Expand your filter to also exclude rows where a character’s “mass” is missing (null).

How many rows remain?

# Your code here
# Solution
sw.filter(
    col('gender').eq('masculine') & col('height').ge(col('height').median()) & col('mass').is_not_null()
)
shape: (28, 11)
nameheightmasshair_colorskin_coloreye_colorbirth_yearsexgenderhomeworldspecies
strf64f64strstrstrf64strstrstrstr
"Darth Vader"202.0136.0"none""white""yellow"41.9"male""masculine""Tatooine""Human"
"Biggs Darklighter"183.084.0"black""light""brown"24.0"male""masculine""Tatooine""Human"
"Obi-Wan Kenobi"182.077.0"auburn, white""fair""blue-gray"57.0"male""masculine""Stewjon""Human"
"Anakin Skywalker"188.084.0"blond""fair""blue"41.9"male""masculine""Tatooine""Human"
"Chewbacca"228.0112.0"brown""unknown""blue"200.0"male""masculine""Kashyyyk""Wookiee"
"Wat Tambor"193.048.0"none""green, grey""unknown"null"male""masculine""Skako""Skakoan"
"Grievous"216.0159.0"none""brown, white""green, yellow"null"male""masculine""Kalee""Kaleesh"
"Tarfful"234.0136.0"brown""brown""blue"null"male""masculine""Kashyyyk""Wookiee"
"Raymus Antilles"188.079.0"brown""light""brown"null"male""masculine""Alderaan""Human"
"Tion Medon"206.080.0"none""grey""black"null"male""masculine""Utapau""Pau'an"

Expand your filter to include the previous result or characters whose “skin_color” is ‘dark’ or ‘pale’

How many rows in this result?

# Your code here
# Solution
sw.filter(
    (col('gender').eq('masculine') & 
     col('height').ge(col('height').median()) & 
     col('mass').is_not_null()) |  
     col('skin_color').is_in(['dark', 'pale'])
)
shape: (37, 11)
nameheightmasshair_colorskin_coloreye_colorbirth_yearsexgenderhomeworldspecies
strf64f64strstrstrf64strstrstrstr
"Darth Vader"202.0136.0"none""white""yellow"41.9"male""masculine""Tatooine""Human"
"Biggs Darklighter"183.084.0"black""light""brown"24.0"male""masculine""Tatooine""Human"
"Obi-Wan Kenobi"182.077.0"auburn, white""fair""blue-gray"57.0"male""masculine""Stewjon""Human"
"Anakin Skywalker"188.084.0"blond""fair""blue"41.9"male""masculine""Tatooine""Human"
"Chewbacca"228.0112.0"brown""unknown""blue"200.0"male""masculine""Kashyyyk""Wookiee"
"Tarfful"234.0136.0"brown""brown""blue"null"male""masculine""Kashyyyk""Wookiee"
"Raymus Antilles"188.079.0"brown""light""brown"null"male""masculine""Alderaan""Human"
"Sly Moore"178.048.0"none""pale""white"nullnullnull"Umbara"null
"Tion Medon"206.080.0"none""grey""black"null"male""masculine""Utapau""Pau'an"
"Finn"nullnull"black""dark""dark"null"male""masculine"null"Human"

What is the average height and average mass of characters who are from a “homeworld” whose name starts with the letters ‘Co’?

Hint: you can “chain” a .select() onto the result of a .filter()

# Your code
sw.filter(
    col('homeworld').str.starts_with('Co')
).select(
    col('height','mass').mean()
)
shape: (1, 2)
heightmass
f64f64
175.66666771.5

Summarize the mean and standard-deviation of the “height” and “mass” of all non-Human characters, split by “sex”, and sort the result by the mean height in descending order.

Hint: you can chain a .group_by() after a .filter(), and chain .sort() at the end

# Your code here
# Solution
sw.filter(
    ~col('species').eq('Human')
).group_by('sex').agg(
    height_mean = col('height').mean(),
    height_std = col('height').std(),
    mass_mean = col('mass').mean(),
    mass_std = col('mass').std(),
).sort('height_mean', descending=True)
shape: (4, 5)
sexheight_meanheight_stdmass_meanmass_std
strf64f64f64f64
"female"179.57142916.08163753.8666673.089768
"male"176.91176546.29900776.4834.240717
"hermaphroditic"175.0null1358.0null
"none"131.249.14977169.7551.031853

Extend your summary to filter out any values of “sex” that are “hermaphroditic” or “none” (not missing/null!)

# Your code here
# Solution
sw.filter(
    ~col('species').eq('Human') & ~col('sex').is_in(['hermaphroditic', 'none'])
).group_by('sex').agg(
    height_mean = col('height').mean(),
    height_std = col('height').std(),
    mass_mean = col('mass').mean(),
    mass_std = col('mass').std(),
).sort('height_mean', descending=True)
shape: (2, 5)
sexheight_meanheight_stdmass_meanmass_std
strf64f64f64f64
"female"179.57142916.08163753.8666673.089768
"male"176.91176546.29900776.4834.240717

Add a column to the original DataFrame called “mass_lbs” that converts “mass” to lbs by multiplying by 2.2 and drop all remaining columns except: name, height, and species.

Remove any rows with null values and save the result to a new DataFrame called sw_processed

# Your code here
# Solution
sw_processed = (
    sw.with_columns(
        mass_lbs = col('mass') * 2.2
    ).select(['name','height','species','mass_lbs']).drop_nulls()
)
sw_processed
shape: (56, 4)
nameheightspeciesmass_lbs
strf64strf64
"Luke Skywalker"172.0"Human"169.4
"C-3PO"167.0"Droid"165.0
"R2-D2"96.0"Droid"70.4
"Darth Vader"202.0"Human"299.2
"Leia Organa"150.0"Human"107.8
"Shaak Ti"178.0"Togruta"125.4
"Grievous"216.0"Kaleesh"349.8
"Tarfful"234.0"Wookiee"299.2
"Raymus Antilles"188.0"Human"173.8
"Tion Medon"206.0"Pau'an"176.0

Use the new DataFrame you created sw_processed to add 2 additional columns called “height_z_species” and “mass_lbs_z_species” that are z-scored versions of height and mass_lbs by “species”

Save the DataFrame by overwriting the previous variable sw_processed

# Your code here
# Solution
zscore = lambda name: ( col(name) - col(name).mean() ) / col(name).std()

sw_processed = sw_processed.with_columns(
    height_z_species = zscore('height').over('species'),
    mass_lbs_z_species = zscore('mass_lbs').over('species'),
)

sw_processed
shape: (56, 6)
nameheightspeciesmass_lbsheight_z_speciesmass_lbs_z_species
strf64strf64f64f64
"Luke Skywalker"172.0"Human"169.4-0.71983-0.222945
"C-3PO"167.0"Droid"165.00.5191670.102877
"R2-D2"96.0"Droid"70.4-0.84605-0.739734
"Darth Vader"202.0"Human"299.21.8977352.828964
"Leia Organa"150.0"Human"107.8-2.639378-1.671308
"Shaak Ti"178.0"Togruta"125.4nullnull
"Grievous"216.0"Kaleesh"349.8nullnull
"Tarfful"234.0"Wookiee"299.20.7071070.707107
"Raymus Antilles"188.0"Human"173.80.676204-0.11949
"Tion Medon"206.0"Pau'an"176.0nullnull

Whats the shape of the result?

# Your code here
# Solution
sw_processed.shape
(56, 6)

How many unique characters are in the result?

# Your code here
sw_processed['name'].n_unique()
56

Drop all non-zscored numeric columns and once again save the result to sw_processed, overwriting your previous DataFrame

# Your code here
# Solution
sw_processed = sw_processed.drop(['height', 'mass_lbs'])
sw_processed
shape: (56, 4)
namespeciesheight_z_speciesmass_lbs_z_species
strstrf64f64
"Luke Skywalker""Human"-0.71983-0.222945
"C-3PO""Droid"0.5191670.102877
"R2-D2""Droid"-0.84605-0.739734
"Darth Vader""Human"1.8977352.828964
"Leia Organa""Human"-2.639378-1.671308
"Shaak Ti""Togruta"nullnull
"Grievous""Kaleesh"nullnull
"Tarfful""Wookiee"0.7071070.707107
"Raymus Antilles""Human"0.676204-0.11949
"Tion Medon""Pau'an"nullnull

What is Leia Organa’s height in z-scores?

# Your code here
# Solution
sw_processed.filter(
    col('name').eq('Leia Organa')
).select('name', 'height_z_species')
shape: (1, 2)
nameheight_z_species
strf64
"Leia Organa"-2.639378

Who’s heavier in z-scores, Owen Lars or Darth Vader and by how much approximately?

# Your code here
# Solution
sw_processed.filter(
    col('name').is_in(['Owen Lars', 'Darth Vader'])
).select('name', 'mass_lbs_z_species')
shape: (2, 2)
namemass_lbs_z_species
strf64
"Darth Vader"2.828964
"Owen Lars"2.001328

Add in a column to your DataFrame sw_processed called “height_category” that splits characters based upon the z-scored height.

If their height is between (or including) -2 and 2 their height_category should be in the “normal”.
If their height is < -2 their height_category should be in the “short”.
If their height is > 2 their height_category should be in the “tall”.

Take the resulting column along with the “name” and “species” columns and save them to a new DataFrame variable called sw_heights

# Your code here
# Solution
sw_heights = sw_processed.with_columns(
    height_category = when(col('height_z_species').ge(-2).and_(col('height_z_species').le(2)))
                      .then(lit('normal'))
                      .when(col('height_z_species').lt(-2))
                      .then(lit('short'))
                      .otherwise(lit('tall'))
                      
).select('name','species', 'height_category')
sw_heights
shape: (56, 3)
namespeciesheight_category
strstrstr
"Luke Skywalker""Human""normal"
"C-3PO""Droid""normal"
"R2-D2""Droid""normal"
"Darth Vader""Human""normal"
"Leia Organa""Human""short"
"Shaak Ti""Togruta""tall"
"Grievous""Kaleesh""tall"
"Tarfful""Wookiee""normal"
"Raymus Antilles""Human""normal"
"Tion Medon""Pau'an""tall"

Summarize the number of unique species by “height_category”. How many of each are there?

# Your code here
# Solution
sw_heights.group_by('height_category').agg(
    col('species').n_unique()
)
shape: (3, 2)
height_categoryspecies
stru32
"short"1
"tall"26
"normal"5

Are there any humans in the “tall” category?

Hint: try using .unique() inside of a .group_by() context. Then use a .select() context on the result that includes an expression using the .list attribute operations to check if ‘Human’ is one of the list items

# Your code here
# Solution
sw_heights.group_by('height_category').agg(
    col('species').unique()
).select(
    col('height_category'),
    col('species').list.contains('Human')
)
shape: (3, 2)
height_categoryspecies
strbool
"normal"true
"short"true
"tall"false

Visualizing Data#

Using the original (unfiltered/aggregated) DataFrame, create a bar-plot showing the mean height for each homeworld and rotate the x-labels by 90 degrees

Hint: You can use sns.barplot to get back a matplotlib axis that you can customize
Or sns.catplot with kind='bar' to get back a Seaborn FacetGrid that is customized slightly differently

# Your code here
g = sns.catplot(data=sw, x='homeworld', y='height', kind='bar')
g.set_xticklabels(rotation=90)
<seaborn.axisgrid.FacetGrid at 0x317e26f50>
../../_images/9d0dff13e1a44ac1c35ab10ef7d5318e7db4687f8d0b3ad3d806204a5bbaeb2e.png

Make a box plot of “mass” for only male and female characters who’s homeworld is Tatooine, Naboo, or Kashyyyk

Hint: you’ll have to filter the data first

# Your code here
to_plot = sw.filter(
    col('sex').is_in(['male', 'female']),
    col('homeworld').is_in(['Tatooine', 'Naboo', 'Kashyyyk'])
)
g = sns.catplot(data=to_plot, x='homeworld', y='mass', kind='box', hue='sex')
g.set_xticklabels(rotation=90)
<seaborn.axisgrid.FacetGrid at 0x3183a2190>
../../_images/a8d8d6926d818960f47feaf05cfa36945218f6b57d73c7b41bf5174dbe346a91.png

Create a facet-grid of 2 scatter plots between height and mass separately for Tatooine and Naboo

Hint: you’ll have to filter again first
Then check-out the multi-plot grid tutorial for help

# Your code here
to_plot = sw.filter(
    col('homeworld').is_in(['Tatooine', 'Naboo'])
)
g = sns.FacetGrid(to_plot, col='homeworld', col_wrap=4)
g.map(sns.scatterplot, 'height', 'mass')
<seaborn.axisgrid.FacetGrid at 0x318a2b490>
../../_images/813b97c0132cae8514bf7df2a178d17b1950f62b5e389483e81086e5002c1512.png

Update the plot to color points by sex

Hint: you can use the .add_legend() method on a FaceGrid to display the legend

# Your code here
to_plot = sw.filter(
    col('homeworld').is_in(['Tatooine', 'Naboo'])
)
g = sns.FacetGrid(to_plot, col='homeworld', hue='sex')
g.map(sns.scatterplot, 'height', 'mass')
g.add_legend()
<seaborn.axisgrid.FacetGrid at 0x319c27f10>
<seaborn.axisgrid.FacetGrid at 0x319c27f10>
../../_images/65c209cbf9a0502887ac998772110f93ff83a110671da44ce18d6f2fd0a1fb49.png

Remove the color by gender and add a regression line to each plot.
For which homeworld is there a stronger relationship?

Hint: you can use the .add_legend() method on a FaceGrid to display the legend

# Your code here
to_plot = sw.filter(
    col('homeworld').is_in(['Tatooine', 'Naboo'])
)
g = sns.FacetGrid(to_plot, col='homeworld')
g.map(sns.regplot, 'height', 'mass')
g.add_legend()
<seaborn.axisgrid.FacetGrid at 0x317b255d0>
<seaborn.axisgrid.FacetGrid at 0x317b255d0>
../../_images/1d82fd55d7467e107853abdbe6eb74a805e18ff301fa9c29b0070218296c7e00.png

Take a look at the pairwise relationships between all numeric columns

Hint: you’ll have to select the columns you want first
Then take a look at sns.pairplot()
If you run into an error, remember you can eaasily convert between a Polars and Pandas DataFrame

# Your code here
to_plot = sw.select(
    cs.numeric()
).to_pandas()

sns.pairplot(to_plot)
<seaborn.axisgrid.PairGrid at 0x319cfe090>
../../_images/8bc89d7f4bf119e91119160547a8e0dc64972c5abfe75c03746f6b485659fe0c.png