Exploratory Data Analysis Workflows

Author

Eshin Jolly

Published

January 13, 2026

In this notebook, we’ll work through a complete Exploratory Data Analysis (EDA) workflow. EDA is the process of getting to know your data before any formal analysis - understanding its structure, spotting patterns, identifying problems, and developing intuitions.

We’ll assume that you’re familiar with generally familiar with EDA from PSYC 201A, so this notebook will demonstrate some common tasks in Python combining polars and seaborn. We’ll use a dataset of characters from the Star Wars universe.

How to Use This Notebook

Note

This notebook walks through some realistic EDA workflow steps you’re likely to encounter. Follow along, run the code, and make sure you understand what’s happening and why. When we move on to building and fitting statistical models, it’ll be important to have an understanding of some key properties of your data:

What variables do you have? Numeric? Categorical? Mixed?
What’s the distribution of each? What’s the general shape of your data? the range? how common are certain values?
What’s missing? What’s the pattern of missing data? Random? Correlated with some variable of interest?
What variables share relationships? No need to fit models to see if varaible simple move together - just plot them!
What’s the data telling you? Contextualize what you see - anything particularly noteworthy? what might guide your future analysis choices?

Tip

EDA is visual detective work. You’re trying to understand what you’re working with to better inform and guide the later modeling choices you might make.

Setup

Let’s import our tools:

Code

import polars as pl
from polars import col, when, lit
import polars.selectors as cs

import seaborn as sns

Phase 1: First Contact with the Data

The first step in any EDA is simply looking at your data. What do you have?

Load and Inspect

Let’s load the Star Wars dataset. This contains information about characters from the Star Wars universe:

Code

sw = pl.read_csv("data/starwars.csv")
sw

shape: (87, 11)

name	height	mass	hair_color	skin_color	eye_color	birth_year	sex	gender	homeworld	species
str	f64	f64	str	str	str	f64	str	str	str	str
"Luke Skywalker"	172.0	77.0	"blond"	"fair"	"blue"	19.0	"male"	"masculine"	"Tatooine"	"Human"
"C-3PO"	167.0	75.0	null	"gold"	"yellow"	112.0	"none"	"masculine"	"Tatooine"	"Droid"
"R2-D2"	96.0	32.0	null	"white, blue"	"red"	33.0	"none"	"masculine"	"Naboo"	"Droid"
"Darth Vader"	202.0	136.0	"none"	"white"	"yellow"	41.9	"male"	"masculine"	"Tatooine"	"Human"
"Leia Organa"	150.0	49.0	"brown"	"light"	"brown"	19.0	"female"	"feminine"	"Alderaan"	"Human"
…	…	…	…	…	…	…	…	…	…	…
"Finn"	null	null	"black"	"dark"	"dark"	null	"male"	"masculine"	null	"Human"
"Rey"	null	null	"brown"	"light"	"hazel"	null	"female"	"feminine"	null	"Human"
"Poe Dameron"	null	null	"brown"	"light"	"brown"	null	"male"	"masculine"	null	"Human"
"BB8"	null	null	"none"	"none"	"black"	null	"none"	"masculine"	null	"Droid"
"Captain Phasma"	null	null	"none"	"none"	"unknown"	null	"female"	"feminine"	null	"Human"

Polars shows us the first and last few rows, plus the column types. Notice: - str columns contain text (names, categories) - i64 columns contain integers - f64 columns contain decimals (floating point numbers)

Let’s get the basic dimensions:

Code

print(f"Rows: {sw.height}")
print(f"Columns: {sw.width}")

Rows: 87
Columns: 11

87 characters with 11 attributes each. Let’s see all the column names:

Code

sw.columns

['name',
 'height',
 'mass',
 'hair_color',
 'skin_color',
 'eye_color',
 'birth_year',
 'sex',
 'gender',
 'homeworld',
 'species']

What do these columns mean?

Before diving into analysis, we should understand what we’re working with: - name: Character name - height: Height in centimeters - mass: Weight in kilograms - hair_color, skin_color, eye_color: Physical appearance - birth_year: Years before the Battle of Yavin (BBY) - sex: Biological sex (male, female, hermaphroditic, none) - gender: Gender identity (masculine, feminine) - homeworld: Planet of origin - species: Species (Human, Droid, Wookiee, etc.)

This context matters! “mass” in Star Wars includes robots and aliens with very different body compositions than humans.

Check for Missing Data

Missing data is everywhere in real datasets. Let’s see how much we’re dealing with:

Code

sw.null_count()

shape: (1, 11)

name	height	mass	hair_color	skin_color	eye_color	birth_year	sex	gender	homeworld	species
u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32
0	6	28	5	0	0	44	4	4	10	4

This tells us: - height is missing for 6 characters - mass is missing for 28 characters (about 1/3 of the data!) - birth_year is missing for 44 characters (over half!) - sex, gender, homeworld, species have some missing values too

Let’s calculate the percentage missing for each column. If we wrap the entire code-block in () Python won’t care about indentation, so we can lay out the code in a more reader friendly way (like R’s pipes):

Code

# Using '*' inside col() selects all columns
(
    sw
    .null_count()
    .select(col('*') / sw.height * 100)
    .transpose(
        include_header=True,
        header_name="column",
        column_names=["pct_missing"]
    )
    .sort("pct_missing", descending=True)
)

shape: (11, 2)

column	pct_missing
str	f64
"birth_year"	50.574713
"mass"	32.183908
"homeworld"	11.494253
"height"	6.896552
"hair_color"	5.747126
…	…
"gender"	4.597701
"species"	4.597701
"name"	0.0
"skin_color"	0.0
"eye_color"	0.0

Visualize Missing Data Patterns

Sometimes missing data has patterns. Let’s visualize which rows have missing values:

seaborn.heatmap

This is new function that we didn’t meet in the previous notebook. And we’ve never called sns.FacetGrid directly - so far we’ve just working with the outputs of sns.relplot, sns.catplot, etc

Don’t sweat it - just try to build your intuitions about what’s happening below. Remember that we saw .map_dataframe when trying to layer plots by passing in any plotting function that understands our data. Well if we start with an empty FacetGrid then we can use the same approach to use sns.heatmap to build it from scratch!

Code

# Create a boolean DataFrame: True where data is missing
missing_matrix = sw.select(col('*').is_null()).to_pandas()

# Initialize the facet grid first - just data no aesthetic mappings
grid = sns.FacetGrid(data=missing_matrix, aspect=.75, height=8)

# Map the seaborn heatmap function
grid.map_dataframe(sns.heatmap, cmap='Grays', yticklabels=sw['name'], cbar=False)

# Adjust aesthetics
grid.tick_params(axis='y', labelsize=6)
grid

Each row is a character, each column is a variable. Dark cells indicate missing values.

Notice that birth_year and mass have scattered missing values across many characters - it’s not concentrated in a particular group.

Your Turn

Try to combine some other polars and seaborn workflows to explore the dataset for yourself and understand what you’re working with

Code

# Your code here

Code

# Your code here

Solution

Code

# Explore the distribution of species
sw["species"].value_counts().sort("count", descending=True).head(10)

shape: (10, 2)

species	count
str	u32
"Human"	35
"Droid"	6
null	4
"Gungan"	3
"Mirialan"	2
"Kaminoan"	2
"Wookiee"	2
"Twi'lek"	2
"Zabrak"	2
"Quermian"	1

Code

# Visualize the relationship between height and sex
sns.catplot(
    data=sw.filter(col("sex").is_in(["male", "female"])),
    x="sex",
    y="height",
    kind="violin"
).set_axis_labels("Sex", "Height (cm)")

Phase 2: Understanding Individual Variables

Before looking at relationships, understand each variable on its own.

Numeric Variables

We have three numeric variables: height, mass, and birth_year.

Let’s get summary statistics:

Code

# cs.numeric() is a polars selector that gets all numeric columns
sw.select(cs.numeric()).describe()

shape: (9, 4)

statistic	height	mass	birth_year
str	f64	f64	f64
"count"	81.0	59.0	43.0
"null_count"	6.0	28.0	44.0
"mean"	174.604938	97.311864	87.565116
"std"	34.774157	169.457163	154.691439
"min"	66.0	15.0	8.0
"25%"	167.0	56.2	37.0
"50%"	180.0	79.0	52.0
"75%"	191.0	85.0	72.0
"max"	264.0	1358.0	896.0

Some observations: - height: Ranges from 66cm to 264cm. Mean is 175cm (about 5’9”). - mass: Ranges from 15kg to 1,358kg! That max is suspicious… - birth_year: Ranges from 8 to 896 BBY. That’s a huge range.

Let’s visualize these distributions:

Code

sns.displot(
    data=sw,
    x="height",
    kind="hist",
    bins=20
).set_axis_labels("Height (cm)", "Count")

Height looks roughly normal, maybe slightly right-skewed. Most characters are between 150-200cm.

Code

sns.displot(
    data=sw,
    x="mass",
    kind="hist",
    bins=30
).set_axis_labels("Mass (kg)", "Count")

Whoa! The mass distribution is extremely right-skewed. Most characters cluster at low values, but there’s an extreme outlier pulling the scale.

Who is that outlier?

Code

sw.filter(col("mass") > 500).select("name", "mass", "species")

shape: (1, 3)

name	mass	species
str	f64	str
"Jabba Desilijic Tiure"	1358.0	"Hutt"

Jabba the Hutt weighs 1,358 kg! That’s not an error - Hutts are massive slug-like aliens.

This is why EDA matters: an outlier like this would massively distort any mean or correlation we calculate.

Let’s look at mass excluding extreme outliers to see the main distribution:

Code

sns.displot(
    data=sw.filter(col("mass") < 200),
    x="mass",
    kind="hist",
    bins=20
).set_axis_labels("Mass (kg)", "Count")

Now we can see the actual distribution. Most characters are between 50-100kg, with a cluster of lighter characters (possibly droids or smaller species).

Categorical Variables

For categorical variables, we want to know: What are the categories and how frequent is each?

Code

sw["species"].value_counts().sort("count", descending=True).head(10)

shape: (10, 2)

species	count
str	u32
"Human"	35
"Droid"	6
null	4
"Gungan"	3
"Mirialan"	2
"Twi'lek"	2
"Wookiee"	2
"Zabrak"	2
"Kaminoan"	2
"Pau'an"	1

Humans dominate the dataset (35 characters), followed by Droids (6). Most species appear only once.

Code

sw["homeworld"].value_counts().sort("count", descending=True).head(10)

shape: (10, 2)

homeworld	count
str	u32
"Naboo"	11
null	10
"Tatooine"	10
"Kamino"	3
"Coruscant"	3
"Alderaan"	3
"Ryloth"	2
"Kashyyyk"	2
"Corellia"	2
"Mirial"	2

Naboo and Tatooine are the most common homeworlds, which makes sense - many main characters come from these planets.

Code

sns.catplot(
    data=sw,
    y="sex",
    kind="count",
    order=sw["sex"].value_counts().sort("count", descending=True)["sex"].to_list()
).set_axis_labels("Count", "Sex")

The dataset is heavily male-dominated, reflecting the original Star Wars films. “none” likely refers to droids.

Reflection: What does this imbalance mean for any analysis we might do comparing sexes? We have ~60 males but only ~16 females.

Your Turn

Try creating a few plots that help you understand the other variables in the dataset

Code

# Your code here

Code

# Your code here

Code

# Your code here

Solution

Code

# Eye color distribution
sns.catplot(
    data=sw,
    y="eye_color",
    kind="count",
    order=sw["eye_color"].value_counts().sort("count", descending=True)["eye_color"].to_list(),
    height=6
).set_axis_labels("Count", "Eye Color")

Code

# Gender breakdown
sw["gender"].value_counts().sort("count", descending=True)

shape: (3, 2)

gender	count
str	u32
"masculine"	66
"feminine"	17
null	4

Code

# Birth year distribution (excluding extreme values)
sns.displot(
    data=sw.filter(col("birth_year").is_not_null() & (col("birth_year") < 200)),
    x="birth_year",
    kind="hist",
    bins=15
).set_axis_labels("Birth Year (BBY)", "Count")

Phase 3: Exploring Relationships

Now let’s look at how variables relate to each other.

Numeric vs Numeric: Scatter Plots

The classic way to explore relationships between two numeric variables is a scatter plot.

Code

sns.relplot(
    data=sw,
    x="height",
    y="mass"
).set_axis_labels("Height (cm)", "Mass (kg)")

There’s a general positive relationship (taller characters tend to be heavier), but Jabba is way off in the corner distorting our view.

Let’s exclude extreme mass values and add a regression line:

Code

sns.lmplot(
    data=sw.filter(col("mass") < 200),
    x="height",
    y="mass"
).set_axis_labels("Height (cm)", "Mass (kg)")

Now the relationship is clearer. But wait - are humans and droids following the same pattern?

Code

# Focus on the most common species
common_species = sw.filter(
    col("species").is_in(["Human", "Droid", "Gungan", "Wookiee"])
)

sns.lmplot(
    data=common_species,
    x="height",
    y="mass",
    hue="species",
    height=5,
    ci=None,
).set_axis_labels("Height (cm)", "Mass (kg)")

Numeric vs Categorical: Comparing Groups

How does height differ between male and female characters?

Code

# Filter to male/female only for cleaner comparison
sw_mf = sw.filter(col("sex").is_in(["male", "female"]))

sns.catplot(
    data=sw_mf,
    x="sex",
    y="height",
    kind="box"
).set_axis_labels("Sex", "Height (cm)")

Male characters tend to be taller, with more variability (wider box, more outliers).

Let’s see the individual data points too:

Code

sns.catplot(
    data=sw_mf,
    x="sex",
    y="height",
    kind="swarm"
).set_axis_labels("Sex", "Height (cm)")

With only ~16 female characters, we should be cautious about drawing strong conclusions. The samples are very unequal.

Faceting: Breaking Down by Multiple Categories

One of seaborn’s strengths is easily splitting visualizations by category. Let’s look at how mass varies by sex across different homeworlds:

Code

# Focus on homeworlds with enough characters
top_homeworlds = sw["homeworld"].value_counts().filter(col("count") >= 3)["homeworld"].to_list()

sw_subset = sw.filter(
    col("homeworld").is_in(top_homeworlds) &
    col("sex").is_in(["male", "female"]) &
    col("mass").is_not_null()
)

sns.catplot(
    data=sw_subset,
    x="sex",
    y="mass",
    col="homeworld",
    kind="strip",
    col_wrap=3,
    height=3
).set_axis_labels("Sex", "Mass (kg)")

We can see a few patterns: - Most homeworlds have more male characters - Tatooine has characters spanning a wide mass range - Naboo characters tend to be lighter

Your Turn

Why might Naboo characters tend to be lighter? (Hint: think about which characters are from Naboo) Or explore some other relationships that you are interested

Code

# Your code here

Code

# Your code here

Solution

Code

# Let's see who is from Naboo
sw.filter(col("homeworld") == "Naboo").select("name", "species", "sex", "mass")

shape: (11, 4)

name	species	sex	mass
str	str	str	f64
"R2-D2"	"Droid"	"none"	32.0
"Palpatine"	"Human"	"male"	75.0
"Padmé Amidala"	"Human"	"female"	45.0
"Jar Jar Binks"	"Gungan"	"male"	66.0
"Roos Tarpals"	"Gungan"	"male"	82.0
…	…	…	…
"Ric Olié"	"Human"	"male"	null
"Quarsh Panaka"	"Human"	"male"	null
"Gregar Typho"	null	null	85.0
"Cordé"	null	null	null
"Dormé"	"Human"	"female"	null

Naboo characters include Padmé Amidala and other human politicians/royalty (typically lighter build), plus Jar Jar Binks and other Gungans. The dataset doesn’t include many heavy Naboo characters.

Code

# Let's explore another relationship: species and eye color
sw.filter(col("species").is_in(["Human", "Droid"])).group_by(
    ["species", "eye_color"]
).agg(
    count=col("name").count()
).sort(["species", "count"], descending=[False, True])

shape: (11, 3)

species	eye_color	count
str	str	u32
"Droid"	"red"	3
"Droid"	"yellow"	1
"Droid"	"red, blue"	1
"Droid"	"black"	1
"Human"	"brown"	16
…	…	…
"Human"	"yellow"	2
"Human"	"hazel"	2
"Human"	"blue-gray"	1
"Human"	"dark"	1
"Human"	"unknown"	1

Phase 4: Asking Questions

Question 1: Which species is the tallest on average?

Code

species_height = sw.group_by("species").agg(
    mean_height=col("height").mean(),
    count=col("height").count()
).filter(
    col("count") >= 2  # Only species with multiple characters
).sort("mean_height", descending=True)

species_height.head(10)

shape: (9, 3)

species	mean_height	count
str	f64	u32
"Wookiee"	231.0	2
"Kaminoan"	221.0	2
"Gungan"	208.666667	3
"Twi'lek"	179.0	2
"Human"	178.0	30
null	175.0	4
"Zabrak"	173.0	2
"Mirialan"	168.0	2
"Droid"	131.2	5

Code

sns.catplot(
    data=species_height.head(10),
    y="species",
    x="mean_height",
    kind="bar",
    height=5,
    aspect=1.2
).set_axis_labels("Mean Height (cm)", "Species")

Kaminoans are the tallest species on average, followed by Wookiees. But note we filtered to species with at least 2 characters - with only 1-2 members, “averages” don’t mean much.

Question 2: Is there a relationship between height and birth year?

Do older characters tend to be shorter (like in real human populations)?

Code

# Filter out extreme birth years and missing values
sw_age = sw.filter(
    col("birth_year").is_not_null() &
    (col("birth_year") < 200)  # Exclude Yoda and others
)

sns.lmplot(
    data=sw_age,
    x="birth_year",
    y="height",
    height=5
).set_axis_labels("Birth Year (BBY)", "Height (cm)")

There’s a very weak positive relationship - older characters (higher BBY) might be slightly shorter. But the relationship is weak and could easily be noise.

Let’s check humans only:

Code

sw_age_human = sw_age.filter(col("species") == "Human")
(
    sns.lmplot(
        data=sw_age_human,
        x="birth_year",
        y="height",
        height=5
    )
    .set_axis_labels("Birth Year (BBY)", "Height (cm)")
    .figure.suptitle("Humans Only", y=1.02, x=.55)
)

Text(0.55, 1.02, 'Humans Only')

Question 3: What’s the most common eye color for each species?

Code

# For species with at least 3 characters
common_species_list = sw["species"].value_counts().filter(col("count") >= 3)["species"].to_list()

eye_by_species = sw.filter(
    col("species").is_in(common_species_list)
).group_by(["species", "eye_color"]).agg(
    count=col("name").count()
).sort(["species", "count"], descending=[False, True])

# Get the most common eye color for each species
eye_by_species.group_by('species', 'eye_color').len().sort(
    ['species', 'len'], descending=[False, True]).group_by('species').first()

shape: (3, 3)

species	eye_color	len
str	str	u32
"Droid"	"yellow"	1
"Gungan"	"orange"	1
"Human"	"blue-gray"	1

Droids typically have sensor arrays (typically yellow)
Humans have diverse eye colors (brown/blue/yellow/hazel)
Gungans have orange eyes

Phase 5: Creating Derived Variables

Sometimes you need to create new variables to answer your questions.

BMI (Body Mass Index)

Let’s calculate BMI for the humanoid characters:

\[BMI = \frac{mass}{height^2} \times 10000\]

(The 10000 converts cm to m)

Code

sw_bmi = sw.filter(
    col("mass").is_not_null() &
    col("height").is_not_null() &
    (col("mass") < 500)  # Exclude Jabba
).with_columns(
    bmi=(col("mass") / (col("height") ** 2) * 10000).round(1)
)

sw_bmi.select("name", "height", "mass", "species", "bmi").head(10)

shape: (10, 5)

name	height	mass	species	bmi
str	f64	f64	str	f64
"Luke Skywalker"	172.0	77.0	"Human"	26.0
"C-3PO"	167.0	75.0	"Droid"	26.9
"R2-D2"	96.0	32.0	"Droid"	34.7
"Darth Vader"	202.0	136.0	"Human"	33.3
"Leia Organa"	150.0	49.0	"Human"	21.8
"Owen Lars"	178.0	120.0	"Human"	37.9
"Beru Whitesun Lars"	165.0	75.0	"Human"	27.5
"R5-D4"	97.0	32.0	"Droid"	34.0
"Biggs Darklighter"	183.0	84.0	"Human"	25.1
"Obi-Wan Kenobi"	182.0	77.0	"Human"	23.2

Code

sns.displot(
    data=sw_bmi,
    x="bmi",
    kind="hist",
    bins=20
).set_axis_labels("BMI", "Count")

Most characters have BMIs in the 20-30 range, which would be “normal” to “overweight” for humans. But BMI was designed for humans - it doesn’t really make sense for Wookiees or droids!

Code

# Who has the highest BMI?
sw_bmi.sort("bmi", descending=True).select("name", "species", "bmi").head(5)

shape: (5, 3)

name	species	bmi
str	str	f64
"Dud Bolt"	"Vulptereen"	50.9
"Yoda"	"Yoda's species"	39.0
"Owen Lars"	"Human"	37.9
"IG-88"	"Droid"	35.0
"R2-D2"	"Droid"	34.7

Code

# Who has the lowest BMI?
sw_bmi.sort("bmi").select("name", "species", "bmi").head(5)

shape: (5, 3)

name	species	bmi
str	str	f64
"Wat Tambor"	"Skakoan"	12.9
"Padmé Amidala"	"Human"	13.1
"Adi Gallia"	"Tholothian"	14.8
"Sly Moore"	null	15.1
"Roos Tarpals"	"Gungan"	16.3

The lowest BMI belongs to droids (makes sense - they’re light for their size).

Creating Categories from Numeric Variables

Let’s categorize characters by height:

Code

sw_height_cat = sw.with_columns(
    height_category=when(col("height") < 100).then(lit("short"))
        .when(col("height") < 180).then(lit("medium"))
        .when(col("height") < 220).then(lit("tall"))
        .otherwise(lit("very tall"))
)

sw_height_cat["height_category"].value_counts().sort("count", descending=True)

shape: (4, 2)

height_category	count
str	u32
"tall"	39
"medium"	30
"very tall"	11
"short"	7

Code

# Visualize mass by height category
sns.catplot(
    data=sw_height_cat.filter(col("mass") < 200),
    x="height_category",
    y="mass",
    kind="box",
    order=["short", "medium", "tall", "very tall"]
).set_axis_labels("Height Category", "Mass (kg)")

As expected, taller categories have higher mass on average, but there’s overlap between groups.

Your Turn

Try playing around to create some variables you are interested in

Code

# Your code here

Code

# Your code here

Solution

Code

# Create a "power ratio" - mass per unit height (density proxy)
sw_power = sw.filter(
    col("mass").is_not_null() & col("height").is_not_null()
).with_columns(
    power_ratio=(col("mass") / col("height")).round(2)
)

# Who has the highest power ratio?
sw_power.sort("power_ratio", descending=True).select("name", "species", "mass", "height", "power_ratio").head(10)

shape: (10, 5)

name	species	mass	height	power_ratio
str	str	f64	f64	f64
"Jabba Desilijic Tiure"	"Hutt"	1358.0	175.0	7.76
"Grievous"	"Kaleesh"	159.0	216.0	0.74
"IG-88"	"Droid"	140.0	200.0	0.7
"Darth Vader"	"Human"	136.0	202.0	0.67
"Owen Lars"	"Human"	120.0	178.0	0.67
"Jek Tono Porkins"	null	110.0	180.0	0.61
"Bossk"	"Trandoshan"	113.0	190.0	0.59
"Tarfful"	"Wookiee"	136.0	234.0	0.58
"Dexter Jettster"	"Besalisk"	102.0	198.0	0.52
"Chewbacca"	"Wookiee"	112.0	228.0	0.49

Code

# Create age categories based on birth_year
sw_age_cat = sw.filter(col("birth_year").is_not_null()).with_columns(
    age_category=when(col("birth_year") < 30).then(lit("young"))
        .when(col("birth_year") < 60).then(lit("middle-aged"))
        .when(col("birth_year") < 100).then(lit("older"))
        .otherwise(lit("ancient"))
)

sw_age_cat["age_category"].value_counts().sort("count", descending=True)

shape: (4, 2)

age_category	count
str	u32
"middle-aged"	19
"older"	11
"young"	8
"ancient"	5

Phase 6: Documenting Your Findings

Good EDA isn’t just about making plots - it’s about learning something and communicating it.

Here’s what we learned about the Star Wars dataset:

Summary of Findings

Data Quality: - 87 characters, 11 variables - Significant missing data: birth_year (50%), mass (32%), height (7%) - One extreme outlier: Jabba the Hutt (mass = 1,358 kg)

Variable Distributions: - Height is roughly normal, centered around 175 cm - Mass is heavily right-skewed due to outliers - Dataset is dominated by male human characters

Key Relationships: - Height and mass are positively correlated (taller = heavier) - This relationship varies by species (especially for droids) - No clear relationship between age and height within humans

Caveats: - Small sample sizes for most species (only Humans have n>10) - Gender imbalance limits sex comparisons - BMI and similar metrics designed for humans don’t generalize well

Your Turn

Use what you’ve learned to answer these questions. Create as many new cells as you need below each question.

1. Which homeworld has the greatest diversity of species?

Hint: group by homeworld and count unique species

Code

# Your code here (create more cells as needed)

Solution

Code

sw.filter(
    col("homeworld").is_not_null() & col("species").is_not_null()
).group_by("homeworld").agg(
    n_species=col("species").n_unique(),
    species_list=col("species").unique()
).sort("n_species", descending=True).head(10)

shape: (10, 3)

homeworld	n_species	species_list
str	u32	list[str]
"Naboo"	3	["Droid", "Human", "Gungan"]
"Tatooine"	2	["Human", "Droid"]
"Kamino"	2	["Human", "Kaminoan"]
"Coruscant"	2	["Human", "Tholothian"]
"Cerea"	1	["Cerean"]
"Vulpter"	1	["Vulptereen"]
"Trandosha"	1	["Trandoshan"]
"Stewjon"	1	["Human"]
"Dathomir"	1	["Zabrak"]
"Troiken"	1	["Xexto"]

Naboo has the greatest diversity with 3 different species (Human, Gungan, and Droid).

2. Create a visualization comparing the height distributions of Humans vs Droids

Hint: use displot with hue

Code

# Your code here (create more cells as needed)

Solution

Code

sns.displot(
    data=sw.filter(col("species").is_in(["Human", "Droid"])),
    x="height",
    hue="species",
    kind="hist",
    bins=15,
    alpha=0.6
).set_axis_labels("Height (cm)", "Count")

Humans have a wider height distribution centered around 175-180cm, while Droids are more variable with some very short (R2-D2) and some tall (IG-88) units.

3. Who is the heaviest character from each homeworld?

Hint: sort by mass within each homeworld group, then take the first

Code

# Your code here (create more cells as needed)

Solution

Code

sw.filter(
    col("homeworld").is_not_null() & col("mass").is_not_null()
).sort("mass", descending=True).group_by("homeworld").first().select(
    "homeworld", "name", "species", "mass"
).sort("mass", descending=True)

shape: (39, 4)

homeworld	name	species	mass
str	str	str	f64
"Nal Hutta"	"Jabba Desilijic Tiure"	"Hutt"	1358.0
"Kalee"	"Grievous"	"Kaleesh"	159.0
"Tatooine"	"Darth Vader"	"Human"	136.0
"Kashyyyk"	"Tarfful"	"Wookiee"	136.0
"Trandosha"	"Bossk"	"Trandoshan"	113.0
…	…	…	…
"Umbara"	"Sly Moore"	null	48.0
"Vulpter"	"Dud Bolt"	"Vulptereen"	45.0
"Malastare"	"Sebulba"	"Dug"	40.0
"Endor"	"Wicket Systri Warrick"	"Ewok"	20.0
"Aleen Minor"	"Ratts Tyerel"	"Aleena"	15.0

Jabba the Hutt dominates from Nal Hutta (1,358 kg), followed by Grievous from Kalee (159 kg) and IG-88 (140 kg, unknown homeworld).

4. Is there a relationship between hair color and height for human characters?

Hint: filter to humans, then create a boxplot of height by hair color

Code

# Your code here (create more cells as needed)

Solution

Code

# Filter to humans with non-null hair color and height
humans = sw.filter(
    (col("species") == "Human") &
    col("hair_color").is_not_null() &
    col("height").is_not_null()
)

# See what hair colors we have
humans["hair_color"].value_counts().sort("count", descending=True)

shape: (10, 2)

hair_color	count
str	u32
"brown"	10
"black"	7
"none"	3
"blond"	3
"white"	2
"grey"	1
"auburn, grey"	1
"brown, grey"	1
"auburn"	1
"auburn, white"	1

Code

sns.catplot(
    data=humans,
    x="hair_color",
    y="height",
    kind="box",
    height=5,
    aspect=1.5,
    order=humans["hair_color"].value_counts().sort("count", descending=True)["hair_color"].to_list()
).set_axis_labels("Hair Color", "Height (cm)")

Theres no clear relationship between hair color and height. Brown-haired humans show the widest range (most data points), while other colors have too few observations to draw conclusions.