Appendix: More Seaborn Plotting#

Adapted from this post

The intent of this notebook is to showcase the common Seaborn plots that are useful for exploratory data analysis.

# import the required libraries
import matplotlib.pyplot as plt
import seaborn as sns
import polars as pl
# All datasets in seaborn
dataset_names = sns.get_dataset_names()
print("Datasets:", dataset_names)
Datasets: ['anagrams', 'anscombe', 'attention', 'brain_networks', 'car_crashes', 'diamonds', 'dots', 'dowjones', 'exercise', 'flights', 'fmri', 'geyser', 'glue', 'healthexp', 'iris', 'mpg', 'penguins', 'planets', 'seaice', 'taxis', 'tips', 'titanic']

Plots#

Here are, in no particular order, the common plot types useful for exploratory data analysis we will examine:

  • Scatter plot

  • Histogram

  • Count plot

  • Boxplot

  • Line chart

  • Pairplot

  • Jointplot

Scatter Plot#

A scatter plot shows how two things are related. You put one thing on the x-axis, another on the y-axis, and each dot on the plot represents one set of these two things. It helps you see if the two things have any connection. If the dots go up as you move to the right, it’s a positive connection. If they go down, it’s negative. If there’s no clear pattern, it means there’s probably no connection

# load penguin data
penguins_data = pl.DataFrame(sns.load_dataset('penguins') )

# sample of the data
penguins_data.head()
shape: (5, 7)
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
strstrf64f64f64f64str
"Adelie""Torgersen"39.118.7181.03750.0"Male"
"Adelie""Torgersen"39.517.4186.03800.0"Female"
"Adelie""Torgersen"40.318.0195.03250.0"Female"
"Adelie""Torgersen"nullnullnullnullnull
"Adelie""Torgersen"36.719.3193.03450.0"Female"
# plot scatter plot
sns.relplot(data = penguins_data,
            x= 'bill_length_mm',
            y= 'bill_depth_mm',
            kind= 'scatter')
<seaborn.axisgrid.FacetGrid at 0x168709cd0>
../../_images/018a7edc056454a1e98092cdcbdb90ec4f13f882f202d217214baed1ee7660db.png

Histogram#

A histogram is like a bar chart but for numbers. It shows how often different values appear in a dataset. You put numbers in groups, called ‘bins,’ on the x-axis, and how many times those numbers occur on the y-axis. It helps you understand the distribution of your data. If the bars are higher on one side, it means more numbers fall into that range. It’s great for seeing patterns and outliers in your data.

# load tips data
tips_data = pl.DataFrame(sns.load_dataset('tips'))

# sample of the data
tips_data.head()
shape: (5, 7)
total_billtipsexsmokerdaytimesize
f64f64catcatcatcati64
16.991.01"Female""No""Sun""Dinner"2
10.341.66"Male""No""Sun""Dinner"3
21.013.5"Male""No""Sun""Dinner"3
23.683.31"Male""No""Sun""Dinner"2
24.593.61"Female""No""Sun""Dinner"4
# plot histogram plot
sns.displot(data= tips_data,
            x= 'total_bill',
            bins = 10,
            kind= 'hist')
<seaborn.axisgrid.FacetGrid at 0x1687286d0>
../../_images/4180db595fa9a99829d6c3f6b0121da62fdcc01dde346784a4518b21fdb35d00.png

Count Plot#

Count plot is a simple way to show how many times each category appears in a dataset. It’s like a bar chart, where the categories are listed on the x-axis, and the count (frequency) of each category is shown on the y-axis. It’s useful for quickly understanding the distribution of categorical variables in your data. The taller the bar, the more times that category appears in your dataset. It’s handy for spotting common categories or imbalances in your data.

# Using the island column of the penguins data loaded earlier.
sns.countplot(data= penguins_data, x= 'island')
<Axes: xlabel='island', ylabel='count'>
../../_images/de18a5136dc05ae377118c86fedc3b4e22d9afeb9cfc5c2ab4edafea49bbe4d3.png

Box Plot#

A box plot is a compact way to display the distribution of numerical data and identify outliers. It shows the median (middle value), quartiles (dividing the data into four equal parts), and any outliers in the data. The ‘box’ represents the middle 50% of the data, with the line inside it representing the median. The ‘whiskers’ extend to the smallest and largest non-outlier values. Points outside the whiskers are considered outliers. It’s helpful for comparing distributions and identifying unusual data points.

# Using the island column of the penguins data loaded earlier.
sns.catplot(data= penguins_data,
            y= 'bill_length_mm',
            kind= 'box',
            hue = 'species')
<seaborn.axisgrid.FacetGrid at 0x1697612d0>
../../_images/555d65f9c7aa2a050f169dec6894597dace4632c808570c755a45881d521fe87.png

Line Chart#

A line chart is a type of graph that shows how data changes over time or another continuous interval. It’s useful for visualizing trends, comparing data sets, or identifying patterns over time. The x-axis typically represents the time or interval, while the y-axis represents the value being measured.

# load stock data
dowjones_data = pl.DataFrame(sns.load_dataset('dowjones'))

# sample of the data
dowjones_data.head()
shape: (5, 2)
DatePrice
datetime[ns]f64
1914-12-01 00:00:0055.0
1915-01-01 00:00:0056.55
1915-02-01 00:00:0056.0
1915-03-01 00:00:0058.3
1915-04-01 00:00:0066.45
# plot line chart
sns.relplot(data= dowjones_data,
            x= 'Date',
            y= 'Price',
            kind= 'line')
<seaborn.axisgrid.FacetGrid at 0x1747045d0>
../../_images/e821a6aa0af8ca241f73888ec499cf56eabd4ea150ceb0059aceaf3afa565f9f.png

Pair Plot#

A pair plot, also known as a scatterplot matrix, is a grid of scatterplots showing the relationships between pairs of variables in a dataset. Each scatterplot in the grid represents the relationship between two numerical variables. It helps visualize the relationships and correlations between multiple variables simultaneously. The diagonal of the pair plot typically shows a histogram or kernel density plot for each variable, allowing you to see the distribution of each variable individually. Pair plots are useful for exploring multivariate relationships and identifying patterns or trends in the data.

A kernel density plot is a smoothed version of a histogram.

# Using the penguins data loaded earlier.
sns.pairplot(data=penguins_data.to_pandas(), hue="species")
<seaborn.axisgrid.PairGrid at 0x174dccbd0>
../../_images/3a491a789697240086a0b252626a30da3a82dfc714763bea3c16fe35f42cc87a.png

Jointplot#

A Seaborn jointplot combines a scatterplot and two histograms. It shows the relationship between two numerical variables by plotting their joint distribution. The central part of the jointplot displays a scatterplot of the two variables, while the marginal histograms show the distribution of each variable individually. It’s useful for visualizing correlations between variables and understanding their distributions simultaneously.

# Using the penguins data loaded earlier.
sns.jointplot(data=penguins_data,
              x="flipper_length_mm",
              y= "bill_length_mm",
              hue="species")
<seaborn.axisgrid.JointGrid at 0x1768433d0>
../../_images/070700430cabf7798c79e5b1ba9bcae07bc3e6cc050d3d0407797374c168c9e5.png

Basic Customization#

Here is a list of items we will look at:

Plot style and colour#

  • changing style

  • changing palette

  • create and use custom palette

Adding tittles and labels#

  • FacetGrids (Figure-level functions) vs AxesSubplots (Axes-level functions)

  • adding a title to a facetgrid object

  • adding a title and axis labels

  • rotating x-tick labels

Plot Style#

Seaborn has 5 preset figure styles which change the background and axes of the plot. They are:

  • “white” : provides clean axes with a solid white background

  • “whitegrid”: whitegrid add a gray grid in the background

  • “dark”: provides a gray background

  • “darkgrid”: provides a gray background with a white grid and

  • “ticks”: similar to white but adds small tick marks to the x- and y-axes.

The default figure style is “white”.

To set one of these as the global style for all plots, use the “set style function”

sns.set_style("darkgrid")
sns.countplot(data= penguins_data, x= 'island')
<Axes: xlabel='island', ylabel='count'>
../../_images/fc1976e52525516fbf763c3a7c2015cbcfbac021cc2b253d951f71150c44a4e1.png

Color#

Seaborn comes with many preset colour palettes that can be referred to by name. Palette comes in the following types:

  • Qualitative Palettes: These are suitable for categorical data where no particular ordering is implied. Examples include “deep,” “bright,” “pastel,” and “dark.”

  • Sequential Palettes: These are suitable for ordered data where you want to show variation from low to high values. Examples include “viridis,” “inferno,” “cividis,” and “magma.” -Diverging Palettes: These are suitable for ordered data where you want to highlight both low and high values relative to a midpoint. Examples include “coolwarm,” “RdBu,” “PuOr,” and “Spectral.

  • Categorical Palettes: These are suitable for categorical data where you want distinct colors for each category but don’t require an inherent order. Examples include “husl,” “hls,” “Paired,” and “Set3.”

You can also create own custom palette.

Qualitative Palettes#

palettes = ['deep', 'muted', 'pastel', 'bright', 'dark', 'colorblind']

for p in palettes:
  print(p)
  sns.set_palette(p)
  sns.palplot(sns.color_palette())
deep
muted
pastel
bright
dark
colorblind
../../_images/40300f6e5236a87ce50231af7e19678e859bcd378d5332979b2d4d842940bcd1.png ../../_images/04da683a36d5b9b93088957fdcf08b818af5281928b672adeb77c45284dff8fc.png ../../_images/481d3c65510ce6cda8fb55986b9502c563f3689c07c2880d5f3239ca75c2f702.png ../../_images/262b15b95613acc527f12be9029cff22cd20c7b03e87bb6d4fd81457aa23675d.png ../../_images/424a4625f8ab2de6ff0cfa9736092c63485eb9b4a841dd525854fb6c9c9a5e6c.png ../../_images/651904a81e3096d2a56f4fc6ba1cf6713c915875d49ce47d42c9d2ca1f084201.png

Sequential Palettes#

palettes = ['viridis', 'inferno', 'cividis','magma']

for p in palettes:
  print(p)
  sns.set_palette(p)
  sns.palplot(sns.color_palette())
viridis
inferno
cividis
magma
../../_images/24db5129aa1eb2869f052a02d415e930e23ba5526166a2a52b3e91b624f1783f.png ../../_images/a7f32b6203ada8e98b848a6fdd91b240f821f25872e86d1ed89372546ad1bd4a.png ../../_images/2214e0a9ac766f990e7daf0a9a930843a023553516a3e688c1667fa30f6b6569.png ../../_images/c2ba9d19dce33daafdaaf8bee9f7fe20dd6e932969fafaa6666695cce3f38bd0.png

Diverging Palettes#

palettes = ['RdBu', 'PRGn', 'RdBu_r', 'PRGn_r']
# note the "_r" append to the palette name reverse the palette.

for p in palettes:
  print(p)
  sns.set_palette(p)
  sns.palplot(sns.color_palette())
RdBu
PRGn
RdBu_r
PRGn_r
../../_images/e3355a2b8d8eba6800d715993e200e8b5a1ca4763c84f075b4ce861d606c1bfd.png ../../_images/cecff4aba9ceac9d3a3556bbbc46d30c1a1552b8d95a2ae0e015e12f97db77a2.png ../../_images/9c8d21fc32bf5d262bb24305df4debc67906144bba47414ffba4f2aa89700847.png ../../_images/ef1bc55eb6afcf7d5ebd6977ecf6347146aaeb9e245d645d50f01f69cda70cc6.png

Categorical Palettes#

palettes = ['husl', 'hls', 'Paired', 'Set3']


for p in palettes:
  print(p)
  sns.set_palette(p)
  sns.palplot(sns.color_palette())
husl
hls
Paired
Set3
../../_images/a06d36541b875dd667ee2aa34501dd742802aff9b584884e30df3a608a667013.png ../../_images/98e2556528166c7de15a29f5878e9466eb6ee7da3785d0a7d080ad65890a3151.png ../../_images/087a20233e69e050e7cf62bacd5c6086559061b170a40d292e6e5bb2587cadaa.png ../../_images/bbb84480aea82fecd99b333db10dedfc25c2051feb29422bfce075b678435053.png

Custom Palette#

You can create own custom palettes by passing in a list of colour names or a list of hex colour codes.

custom_palette = ['#FBB4AE', '#B3CDE3', '#CCEBC5',
                  '#DECBE4', '#FED9A6', '#FFFFCC']

Lets apply our custom palette to a chart.

sns.set_palette(custom_palette)
#if you want other palettes just insert the palette name inside the bracket.

sns.catplot(data= penguins_data,
            y= 'bill_length_mm',
            kind= 'box',
            hue = 'species')
<seaborn.axisgrid.FacetGrid at 0x314530310>
../../_images/54789b219b9bebec4cbc563d5a156eb61f8c29ae5a5292500264980085e2816a.png

FacetGrid vs AxesSubplot#

Reminder, there are two types of Seaborn plot types:

  • Figure- level: FacetGrid

  • Axes- level: AxesSubplot

The customization of labels and axes are different for the two types of plot.

If you’re unsure, which is which, there is a function “type” that will tell you.

# Here is an example to figure out the plot type:
g = sns.scatterplot(data= penguins_data,
                    x= 'bill_length_mm',
                    y= 'flipper_length_mm')

type(g)
matplotlib.axes._axes.Axes
../../_images/a0ea4de1d92908128a76f2dcacd8462d71796c94a49828f48270bd0246ef4423.png

The output (“matplotlib.axes._axes.Axes) tell us this scatterplot is a AxesSubplot object.

Adding a Title to FacetGrid#

g = sns.catplot(data= penguins_data,
                y= 'bill_length_mm',
                kind= 'box',
                hue = 'species')

# Add title. y parameter adjust the height of the title
g.figure.suptitle("Penguin Bill Length Box Plot", y= 1.03)
Text(0.5, 1.03, 'Penguin Bill Length Box Plot')
../../_images/61b2c9382391ebe702b829b179e6d42af77a78dbe06ae0fe86aebae5179ddda3.png

Adding a Title to AxesSubplot#

g = sns.boxplot(data= penguins_data,
                y= 'bill_length_mm',
                hue = 'species')

g.set_title("Penguin Bill Length Box Plot", y= 1.03)
Text(0.5, 1.03, 'Penguin Bill Length Box Plot')
../../_images/8e78dd7a82340be8cb81eb1d6ca5d58cbaa1fe9834cb7582311607f995f7657e.png

Adding Title for FacetGrid Subplots#

g = sns.catplot(data= penguins_data,
                y= 'bill_length_mm',
                kind= 'box',
                hue = 'species',
                col= 'island')
# "col" parameter creates three subplots.

g.figure.suptitle("Penguin Bill Length Box Plot", y= 1.03)
# Add title. y parameter adjust the height of the title
Text(0.5, 1.03, 'Penguin Bill Length Box Plot')
../../_images/46cd98d94774f62f48d0a726b66ad6bd80d57191115ff76764d50003e29491ff.png
# notice with the subplots we have subtitles. We can alter the subtitles.

g = sns.catplot(data= penguins_data,
                y= 'bill_length_mm',
                kind= 'box',
                hue = 'species',
                col= 'island')

g.figure.suptitle("Penguin Bill Length Box Plot", y= 1.03)

g.set_titles("This is {col_name} Island")
# setting subtitles. {col_name} is the variable.
Text(0.5, 1.03, 'Penguin Bill Length Box Plot')
<seaborn.axisgrid.FacetGrid at 0x314755850>
../../_images/3ff3c36ac0ef6bd4cb31fd84db3e0e3db5ec380848e7b2601673660490a5686e.png

Adding axis labels#

Same method for FacetGrid and AxesSubpot plot types.

g = sns.catplot(data= penguins_data,
                x= 'species',
                y= 'bill_length_mm',
                kind= 'box',
                hue = 'species')

g.set(xlabel = 'species',
      ylabel= "bill_length_mm")
<seaborn.axisgrid.FacetGrid at 0x3147c6590>
../../_images/9a1a7c7ae0dc9c0f5c116d4249daa38290c3ad7200fe164ae12b95d9d0ded6b4.png

Rotating x-axis/ y-axis tick labels#

Sometime tick labels may overlap, making it hard to read. You could address by rotating the tick labels. To do this, we don’t call a function on the plot itself. Instead, after the plot is created, we call the matplotlib function ‘plt.xticks’or ‘plt.yticks’ and set rotation. This works with both FacetGrid and AxesSubplot

g = sns.catplot(data= penguins_data,
                x= 'species',
                y= 'bill_length_mm',
                kind= 'box',
                hue = 'species')

g.set(xlabel = 'species',
      ylabel= "bill_length_mm")

plt.xticks(rotation = 90)
<seaborn.axisgrid.FacetGrid at 0x31458fb90>
([0, 1, 2],
 [Text(0, 0, 'Adelie'), Text(1, 0, 'Chinstrap'), Text(2, 0, 'Gentoo')])
../../_images/49f56816c80593da01d15f806aa04cfe70f87763bf115e277926b094dcf41ddc.png