Tidy-data analysis with polars#

IMPORTANT: Install new libraries
Before you use this notebook you need to install some new libraries into your class environment:

  1. Open a new Terminal

  2. conda activate 201b

  3. pip install polars pyarrow itables

Welcome to the absolute beginner’s guide to Polars!

Polars is very user-friendly DataFrame library for working with structured data in Python, but also R and other languages.

This notebook will introduce you to Polars fundamentals and how to manipulate tidy data polars-style.

Quick note on alternative libraries

As you learn more Python and grow your skills, you’ll come across or hear about Pandas probably the most popular library for working with DataFrames in Python. Many statistics courses will have you learn Pandas instead, given how long it’s been around and the wealth of materials available for learning it.

Unfortunately, Pandas is “showing its age” - it’s gone through so many changes and updates, that frankly even Eshin finds it difficult to keep up with the “right way” to do things in Pandas. Because Pandas meets so many different needs for so many different analysts (e.g. economics, finance, psych) - for the type of analyses we in Psych/Neuro/Cogsci are likely to perform on DataFrames - it gets in the way of learning and thinking.

Polars by comparison is quite a bit newer, but contains all the functionality you’ll need, while being much easier to wrap your head around. This will be a learning experience for both you and your instructors - but we strongly believe the alternatives will be uneccesarily challenging on your statistics journey.

So let’s take a look at how to “think in polars.”

How to use this notebook#

This notebook is designed for you to work through at your own pace. When you’re finished you should save and commit your changes and push them to your Github Classroom repository

As you go through this notebook, you should regularly refer to the polars documentation to look things up and general help. Try experimenting by creating new code cells and playing around with the demonstrated functionality.

Remember to use ? or help() from within this notebook to look up how functionality works.

How to import Polars#

We can make polars available by using the import statement. It’s convention to import polars in the following way:

import polars as pl

Why use Polars?#

So far we’ve made most use of Python lists and NumPy arrays. But in practice you’re probably working with some kind of structured data, i.e spreadsheet-style data with columns and rows

Figure 1

In Polars we call this a DataFrame, a 2d table with rows and columns of different types of data (e.g. strings, numbers, etc).

Figure 1

Each column of a DataFrame contains the same type of data. Let’s look at an example by loading a file with the pl.read_csv() function.

This will return a DataFrame we can check out:

df = pl.read_csv('example.csv')
df
shape: (3, 3)
NameAgeSex
stri64str
"Braund, Mr. Owen Harris"22"male"
"Allen, Mr. William Henry"35"male"
"Bonnell, Miss. Elizabeth"58"female"

Notice how Polars tells us the type of each column below it’s name

DataFrame fundamentals#

We can get basic information about a DataFrame by accessing its attributes using . syntax without () at the end:

df.shape
(3, 3)
df.height
3
df.width
3
df.columns
['Name', 'Age', 'Sex']

DataFrames have various methods that we can use with . syntax with a () at the end.

Remember that methods in Python are just functions that “belong” to some object. In this case these methods “belong” to a DataFrame object and can take arguments that operate on it.

Some might be familiar from R or other languages:

.head() gives us the first few rows. Since we only have 3 rows, we can pass an argument to the method to as for the first 2:

df.head(2)
shape: (2, 3)
NameAgeSex
stri64str
"Braund, Mr. Owen Harris"22"male"
"Allen, Mr. William Henry"35"male"

And .tail() is the opposite

df.tail(2) # last 2 rows
shape: (2, 3)
NameAgeSex
stri64str
"Allen, Mr. William Henry"35"male"
"Bonnell, Miss. Elizabeth"58"female"

We can use .glimpse() to transpose a DataFrame.
This can sometimes make it easier to see the column names listed as rows and the values listed as columns:

df.glimpse()
Rows: 3
Columns: 3
$ Name <str> 'Braund, Mr. Owen Harris', 'Allen, Mr. William Henry', 'Bonnell, Miss. Elizabeth'
$ Age  <i64> 22, 35, 58
$ Sex  <str> 'male', 'male', 'female'

And .describe() to get some quick summary stats:

df.describe()
shape: (9, 4)
statisticNameAgeSex
strstrf64str
"count""3"3.0"3"
"null_count""0"0.0"0"
"mean"null38.333333null
"std"null18.230012null
"min""Allen, Mr. William Henry"22.0"female"
"25%"null35.0null
"50%"null35.0null
"75%"null58.0null
"max""Braund, Mr. Owen Harris"58.0"male"

There are many additional methods to calculate statistics as well. But we’ll revisit these later:

df.mean()
shape: (1, 3)
NameAgeSex
strf64str
null38.333333null
df.min()
shape: (1, 3)
NameAgeSex
stri64str
"Allen, Mr. William Henry"22"female"

DataFrames also have a .sample() method that allows you resample rows from the DataFrame with or without replacement.

You can tell Polars to sample all rows without replacement, aka permuting

df.sample(fraction=1, shuffle=True, with_replacement=False)
shape: (3, 3)
NameAgeSex
stri64str
"Braund, Mr. Owen Harris"22"male"
"Bonnell, Miss. Elizabeth"58"female"
"Allen, Mr. William Henry"35"male"

Or resample with replacement, aka bootstrapping

df.sample(fraction=1, shuffle=True, with_replacement=True)
shape: (3, 3)
NameAgeSex
stri64str
"Allen, Mr. William Henry"35"male"
"Allen, Mr. William Henry"35"male"
"Allen, Mr. William Henry"35"male"

We won’t cover it more in this notebook, but think about how you might use .sample() in the future to try out some of the resampling methods we’ve covered in class/HW…

Indexing a DataFrame (for simple stuff only!)#

Because a DataFrame is a 2d table, we can use the same indexing and slicing syntax you learned with numpy to directly access values.

Remember these are 0-indexed: the first row/col is at position 0, not position 1

If this is out DataFrame:

df
shape: (3, 3)
NameAgeSex
stri64str
"Braund, Mr. Owen Harris"22"male"
"Allen, Mr. William Henry"35"male"
"Bonnell, Miss. Elizabeth"58"female"

We can slice it like this:

# 0 row index = 1st row
# 1 col index = 2nd col (age)
df[0, 1]
22

And of course we can slice using start:stop:step, which is always up-to the end value we slice to

# 0:2 slice = rows up to, but not including 2 - just 0, 1
# 0 col index = 1st col (name)
df[0:2,0]
shape: (2,)
Name
str
"Braund, Mr. Owen Harris"
"Allen, Mr. William Henry"

We can also using slicing syntax to quickly refer to columns by name:

# All rows in column 'Name'
df['Name']
shape: (3,)
Name
str
"Braund, Mr. Owen Harris"
"Allen, Mr. William Henry"
"Bonnell, Miss. Elizabeth"

Which is equivalent to

# Explicity slice 'all' rows
# 'Name' = just the Name column
df[:, 'Name']
shape: (3,)
Name
str
"Braund, Mr. Owen Harris"
"Allen, Mr. William Henry"
"Bonnell, Miss. Elizabeth"
# 0:2 slice = rows up to, but not including 2
# 'Name' = just the Name column
df[0:2, 'Name']
shape: (2,)
Name
str
"Braund, Mr. Owen Harris"
"Allen, Mr. William Henry"

Note: While you can access values this way, what makes Polars powerful is that it offers a much richer “language” or set of “patterns” for working with DataFrames, like dplyr’s verbs in R’s Tidyverse.

While it doesn’t map on one-to-one, Polars offers a consistent and intuitive way of working with DataFrames that we’ll teach you in this notebook. Once the fundamentals “click” for you, you’ll be a data manipulating ninja!

Thinking in Polars: Contexts & Expressions#

To understand how to “think” in polars, we need to understand 2 fundamental concepts: contexts and expressions

A context in Polars is how you choose what data you want to operate on. There are only a few you’ll use regularly:

Contexts#

df.select() - to subset columns#

Figure 1

df.with_columns() - to add new columns#

Figure 1

df.filter() - to subset rows#

Figure 1

df.group_by().agg() - to summarize by group#

Figure 1

Expressions#

An expression is any computation we do inside of a context.

To build an expression we first use a selector to choose one or more columns.

The most common selector you’ll use is pl.col() to select one or more columns by name.

Then we used method-chaining, . syntax, to perform operations on the selected columns.

Let’s see a simple example.

Starting simple#

Let’s say we have data from some experiment that contains 3 participants, each of whom made 5 judgements about some stimuli.

We can use the pl.read_csv() function to load a file and get back a polars DataFrame:

df = pl.read_csv('example_2.csv')
df
shape: (15, 3)
participantaccuracyrt
i64i64f64
144261.28435
147728.208489
164801.889016
167713.555026
167362.682105
370439.524973
38855.11928
388644.801272
312571.800553
358715.224208

We can verify there are 15 rows and 3 columns:

df.shape
(15, 3)

Selecting columns: .select()#

The .select context is what you’ll use most often. It lets you apply expressions only to the specific columns you select.

Figure 1

Here’s a simple example. Let’s say we want to calculate the average of the accuracy column.

How would we start?

First, we need to think about our context.
Since we only want the “accuracy” column we can use .select().

df.select(                     # <- this is our context

)

Second, we need to create an expression that means: “use the accuracy column, and calculate it’s mean”.

We can create an expression by combining the selector pl.col() with the operation .mean() using . syntax, i.e. method-chaining.

df.select(                     # <- this is our context
  pl.col('accuracy').mean()   # <- this is an expression, inside this context
)

Let’s try this now:

df.select(                   # <- this is start of the context "select"
    pl.col('accuracy').mean()  # <- this is an expression, inside the context
)                            # <- end of the context "select"
shape: (1, 1)
accuracy
f64
56.066667

Note: Avoiding confusion

Don’t confuse df.select which is a context with a selector like pl.col().

While they might sound similar, we can use any selector inside any context. And we’ll meet a few more soon

Your turn#

How would you build expression to calculate the “median” Reaction Time?

df.select(
    # put your expression here!
)
shape: (0, 0)

Aside: Making our lives a little easier with direct imports#

Typing pl.col over and over can get annoying since it’s such a common selector. So we can save a few key strokes using a mechanism in Python called direct importing.

We’ve seen this in previous notebooks:

from scipy.stats import bootstrap # directly import bootstrap function

Let’s do that with col:

from polars import col

Now we can use col in place of pl.col

Expressing multiple things#

We can add as many expressions inside a context as we want. We just need to separate them with a ,.

Each the result from each expression will be saved to a separate column.

We’ll use the col() selector again to perform two different operations: n_unique() and mean() on two different columns:

df.select(                          # <- this is the context
    col('participant').n_unique(),  # <- this is an expression
    col('accuracy').mean()         # <- this is another expression
)                                   # <- this is the end of the context
shape: (1, 2)
participantaccuracy
u32f64
356.066667

You turn: how would you express the following statement in polars?

Select only the participant and accuracy columns.
For participant, calculate the number of value, i.e count
For accuracy, calculate it’s standard deviation

# Your code here

# 1) Use the .select() context
# 2) Use col() to create 2 expressions separated by a comma

Repeating the same expression across many columns#

Sometimes we might find ourselves repeating the same expression for different columns.

One way we can do that is simply by creating multiple expressions like by before, by using col() to select each column separately:

df.select(
    col('accuracy').median(),
    col('rt').median()   
)
shape: (1, 2)
accuracyrt
f64f64
64.0502.974663

But Polars makes this much easier for us - we can condense this down to a single expression by giving our selector - col() - additional column names:

df.select(
    col('accuracy', 'rt').median()   # <- one expression repeated for both columns
)
shape: (1, 2)
accuracyrt
f64f64
64.0502.974663

These both do the same thing so if you find it helpful to start explicit, building up each expression one at a time, feel free to do that!

Later on you might find it helpful to use a single condensed expression, when you find yourself getting annoyed by repeating yourself.

Renaming expression outputs#

Let’s try creating two expressions that operate on the the same column. In natural language:

Select only the accuracy column.
For accuracy, calculate it’s median
For accuracy, calculate it’s variance

Let’s try it:

df.select(
    col('accuracy').mean(),
    col('accuracy').std()
)

Oops polars is giving us an error:
DuplicateError: the name 'judgement' is duplicate

That’s because Polars tries to help us out by making sure all our column names are unique.

By default the results of an expression are saved using the same column name that you selected.

In this case we selected “accuracy” using col('accuracy') twice - once to calculate the mean and once to calculate the standard deviation. So Polars is trying to save both results into a column called accuracy causing a conflict!

To fix this, we can extend our expression with additional operations using method-chaining with the . syntax.

A common operation is .alias() which you’ll often put at the end of an expression in order to give it a new name:

df.select(
    col('accuracy').mean().alias('acc_mean'),
    col('accuracy').std().alias('acc_std')
)
shape: (1, 2)
acc_meanacc_std
f64f64
56.06666727.043528

Two “styles” of expressing yourself#

You might find this style of “method-chaining” the use of .alias unintuitive at first.

So Polars also lets rename your expressions in a different “style” using = as in other language like R.

In English, we could rephrase our expressions as so:

“Select the accuracy column.
Create a new column named ‘acc_mean’, which is the mean of accuracy.
Create a new column named ‘acc_std’, which is the standard-deviation of accuracy.”

And in code like this:

df.select(
    acc_mean = col('accuracy').mean(),
    acc_std = col('accuracy').std()
)
shape: (1, 2)
acc_meanacc_std
f64f64
56.06666727.043528

You can use which ever style of “phrasing” an expression that feels more natural to you based on what you’re doing!

Your turn#

Run the following code. Why are the values in the accuracy column being overwritten? Can you fix it?

df.select(
    col('participant'),
    col('accuracy').mean(),
)
shape: (15, 2)
participantaccuracy
i64f64
156.066667
156.066667
156.066667
156.066667
156.066667
356.066667
356.066667
356.066667
356.066667
356.066667
# Solution
df.select(
    col('participant'),
    col('accuracy'),
    acc_mean = col('accuracy').mean(),
)
shape: (15, 3)
participantaccuracyacc_mean
i64i64f64
14456.066667
14756.066667
16456.066667
16756.066667
16756.066667
37056.066667
38856.066667
38856.066667
31256.066667
35856.066667

Aggregating columns: .group_by()#

The .group_by('some_col') context is used for summarizing columns separately by 'some_col'.

Figure 1

You always follow up a .groupy_by() with .agg(), and place our expressions inside to tell Polars what should be calculated per group.

Using .group_by() will always give you a smaller DataFrame than the original.
Specifically you will get back a DataFrame who’s rows = number of groups

df.group_by('participant').agg(        # <- context

    col('rt').mean(),                  # <- expression
    col('accuracy').mean(),           # <- another expression
)
shape: (3, 3)
participantrtaccuracy
i64f64f64
3485.29405763.2
2496.96938247.2
1573.52379757.8

Maintaining group order#

Unfortunately, by default Polars doesn’t preserve the order of grouops as they exist in the original DataFrame. But we can easily fix this by giving .group_by() and additional argument maintain_order=True:

df.group_by('participant', maintain_order=True).agg(  
    col('rt').mean(),                  
    col('accuracy').mean(),          
)
shape: (3, 3)
participantrtaccuracy
i64f64f64
1573.52379757.8
2496.96938247.2
3485.29405763.2

Your turn#

Calculate each participant’s average reaction time divided by their average accuracy.
Remember since there are just 3 unique participants, i.e. 3 “groups”, our result should have 3 rows; one for each participant.

Hint: you can divide 2 columns using the method-chaining style with .trudiv() or simply using /

# Your code here

# Hint: use group_by on 'participant' and then create an expression
# that divides the average 'rt' by average 'accuracy' and name it 'rt_acc_avg'
# Solution
df.group_by('participant', maintain_order=True).agg(
    rt_acc = col('rt').mean() / col('accuracy').mean()
)
shape: (3, 2)
participantrt_acc
i64f64
19.922557
210.529012
37.678703

Creating columns: .with_columns()#

Whenever we want to return the original DataFrame along with some new columns we can use the .with_columns context instead of .select.

Figure 1

It will always output the original DataFrame and the outputs of your expressions.

If your expression returns just a single value (e.g. the mean of a column), Polars is smart enough to automatically repeat that value over all the rows to make sure it fits inside the DataFrame.

df.with_columns(                                    # <- with_columns context
    acc_mean = col('accuracy').mean()               # <- expression
)
shape: (15, 4)
participantaccuracyrtacc_mean
i64i64f64f64
144261.2843556.066667
147728.20848956.066667
164801.88901656.066667
167713.55502656.066667
167362.68210556.066667
370439.52497356.066667
38855.1192856.066667
388644.80127256.066667
312571.80055356.066667
358715.22420856.066667

Contrast this with the .select context which will only return the mean of accuracy:

df.select(                                    
    acc_mean = col('accuracy').mean()               
)
shape: (1, 1)
acc_mean
f64
56.066667

As before we can create multiple new columns by including multiple expressions

df.with_columns(
    acc_mean = col('accuracy').mean(),
    rt_scaled = col('rt') / 100 

)
shape: (15, 5)
participantaccuracyrtacc_meanrt_scaled
i64i64f64f64f64
144261.2843556.0666672.612843
147728.20848956.0666677.282085
164801.88901656.0666678.01889
167713.55502656.0666677.13555
167362.68210556.0666673.626821
370439.52497356.0666674.39525
38855.1192856.0666670.551193
388644.80127256.0666676.448013
312571.80055356.0666675.718006
358715.22420856.0666677.152242

Use .with_columns for “Tidy” group-by operations#

A very handy use for .with_columns is to combine it with the .over() operation.

This allows us to calculate an expression separately by group, but then save the results into a DataFrame the same size as the original.

For example, Polars will keep the tidy-format of the data and correctly repeat the values across rows.

df.with_columns(
    acc_mean = col('accuracy').mean().over('participant')
)
shape: (15, 4)
participantaccuracyrtacc_mean
i64i64f64f64
144261.2843557.8
147728.20848957.8
164801.88901657.8
167713.55502657.8
167362.68210557.8
370439.52497363.2
38855.1192863.2
388644.80127263.2
312571.80055363.2
358715.22420863.2

Chaining .over('some_col') to any expression is like using .group_by but preserving the shape of the original DataFrame:

df.with_columns(
    acc_mean = col('accuracy').mean().over('participant'),          
    rt_mean = col('rt').mean().over('participant'),                  
)
shape: (15, 5)
participantaccuracyrtacc_meanrt_mean
i64i64f64f64f64
144261.2843557.8573.523797
147728.20848957.8573.523797
164801.88901657.8573.523797
167713.55502657.8573.523797
167362.68210557.8573.523797
370439.52497363.2485.294057
38855.1192863.2485.294057
388644.80127263.2485.294057
312571.80055363.2485.294057
358715.22420863.2485.294057

Remember that the .group_by() context will always return a smaller aggregated DataFrame:

df.group_by('participant', maintain_order=True).agg(
    acc_mean = col('accuracy').mean(),
    rt_mean = col('rt').mean()
)
shape: (3, 3)
participantacc_meanrt_mean
i64f64f64
157.8573.523797
247.2496.969382
363.2485.294057

In Polars you should only rely on .group_by if you know for sure that you want your output to be smaller than your original DataFrame - and by smaller we mean rows = number of groups.

Your turn#

Create a dataframe that adds 3 new columns:

  1. Accuracy on a 0-1 scale

  2. RT / Accuracy

  3. RT / max RT, separately using each participant’s max RT

# Your code here


# Hint: you can wrap an entire expression in () and use .over() 
# on the entire wrapped expression to do things like 
# add, subtract columns multiple columns by "participant"
# Solution
df.with_columns(
    acc_scaled = col('accuracy') / 100,
    rt_acc = col('rt') / col('accuracy'),
    rt_max_scaled = ( col('rt') / col('rt').max() ).over('participant')
)
shape: (15, 6)
participantaccuracyrtacc_scaledrt_accrt_max_scaled
i64i64f64f64f64f64
144261.284350.445.9382810.325836
147728.2084890.4715.4937980.908116
164801.8890160.6412.5295161.0
167713.5550260.6710.6500750.889843
167362.6821050.675.4131660.452285
370439.5249730.76.2789280.614528
38855.119280.880.6263550.077066
388644.8012720.887.3272870.901537
312571.8005530.1247.6500460.79947
358715.2242080.5812.3314521.0

Selecting rows: .filter()#

The .filter context is used for subsetting rows using a logical expression.

Figure 1

Instead of returning one or more values like other expressions, a logical expression returns True/False values that we can use to filter rows that mean those criteria:

df.filter(
    col('participant') == 1
)
shape: (5, 3)
participantaccuracyrt
i64i64f64
144261.28435
147728.208489
164801.889016
167713.555026
167362.682105

Or in Polars methods-style using .eq():

df.filter(
    col('participant').eq(1)
)
shape: (5, 3)
participantaccuracyrt
i64i64f64
144261.28435
147728.208489
164801.889016
167713.555026
167362.682105

Or even the opposite: we can negate or invert any logical expression by putting a ~ in front of it.

This is like using not in regular Python or ! in some other languages.

df.filter(
    ~col('participant').eq(1)
)
shape: (10, 3)
participantaccuracyrt
i64i64f64
29502.974663
283424.866821
221492.355273
236573.594895
287491.05526
370439.524973
38855.11928
388644.801272
312571.800553
358715.224208

But be careful. If you’re not using the method-chaining style then you need to wrap you expression in () before using ~:

df.filter(
    ~( col('participant') == 1 )
)
shape: (10, 3)
participantaccuracyrt
i64i64f64
29502.974663
283424.866821
221492.355273
236573.594895
287491.05526
370439.524973
38855.11928
388644.801272
312571.800553
358715.224208

Just like in with other contexts (i.e. .select, .with_columns, .group_by) we can using multiple logical expressions to refine our filtering criteria.

If we use , Polars treats them logically as an and statement. For example, we use 2 logical expressions to filter where: participant is 1 AND accuracy is 67:

df.filter(
    col('participant').eq(1),
    col('accuracy').eq(67)
)
shape: (2, 3)
participantaccuracyrt
i64i64f64
167713.555026
167362.682105

But you might find it clearer to use & for and expressions:

df.filter(
    col('participant').eq(1) & col('accuracy').eq(67)
)
shape: (2, 3)
participantaccuracyrt
i64i64f64
167713.555026
167362.682105

The | operator can be used for or expressions

df.filter(
    col('participant').eq(1) | col('participant').eq(3)
)
shape: (10, 3)
participantaccuracyrt
i64i64f64
144261.28435
147728.208489
164801.889016
167713.555026
167362.682105
370439.524973
38855.11928
388644.801272
312571.800553
358715.224208

To combine more complicated logical expressions, you can wrap them in ().

Below we get rows where participant 1’s accuracy is 67 OR any of participant 2’s rows:

df.filter(
    ( col('participant').eq(1) & col('accuracy').eq(67) ) | col('participant').eq(2)
)
shape: (7, 3)
participantaccuracyrt
i64i64f64
167713.555026
167362.682105
29502.974663
283424.866821
221492.355273
236573.594895
287491.05526

Two “styles” of logical expression#

Like renaming the outputs of an expression, Polars gives us 2 styles we can use to combine logical expressions.
We’ve seen the first one using & and |.

The second one uses the method-chaining style with the . syntax. Here Polars provides a .and_() and a .or_() method.

Feel free to use which every style you find more intuitive and readable:

df.filter(
    col('participant').eq(1).and_(   # <- 1st expression, followed by .and_()
        col('accuracy').eq(67)       # <- 2nd expression, followed by .or_()
    ).or_(
        col('participant').eq(2)     # <- 3rd expression, inside the .or_()
        )
)
shape: (7, 3)
participantaccuracyrt
i64i64f64
167713.555026
167362.682105
29502.974663
283424.866821
221492.355273
236573.594895
287491.05526

Expressions are for performing operations#

So far we’ve see how to build up an expression that computes some value, e.g. .mean() or performs some logic, e.g. .eq().

Polars calls these computations operations and include tons of them (accesible via . syntax). Some of the notable ones include:

Arthmetic, e.g. .add, .sub, .mul

Boolean, e.g. .all, .any, .is_null, .is_not_null

Summary (aggregate) stats, e.g. .mean, .median, .std, .count

Comparison, e.g. .gt, .lt, .gte, .lte, .eq, .ne.

Your Turn#

Use the linked documentation and contexts you learned about above to complete the following exercises:

  1. Select the accuracy and RT columns from df and multiply them by 10

# Your code here
# Solution
df.select(
    col('accuracy', 'rt') * 10
)
shape: (15, 2)
accuracyrt
i64f64
4402612.843496
4707282.084892
6408018.890165
6707135.550259
6703626.821046
7004395.249735
880551.192796
8806448.012718
1205718.005529
5807152.242075
  1. Add 2 new columns to the dataframe: rt_acc and acc_max_scaled

For rt_acc divide reaction time by accuracy.

For acc_max_scaled divide accuracy by maximum accuracy, separately by participant.

# Your code here
# Solution
df.with_columns(
    rt_acc = col('rt') / col('accuracy'),
    acc_max_scaled = ( col('accuracy') / col('accuracy').max() ).over('participant')
)
shape: (15, 5)
participantaccuracyrtrt_accacc_max_scaled
i64i64f64f64f64
144261.284355.9382810.656716
147728.20848915.4937980.701493
164801.88901612.5295160.955224
167713.55502610.6500751.0
167362.6821055.4131661.0
370439.5249736.2789280.795455
38855.119280.6263551.0
388644.8012727.3272871.0
312571.80055347.6500460.136364
358715.22420812.3314520.659091
  1. Filter rows where reaction time is > 100ms and < 725ms

# Your code here

# Hint: You should write a logical expression 
df.filter(
    col('rt').ge(100) & col('rt').lt(725)
)
shape: (12, 3)
participantaccuracyrt
i64i64f64
144261.28435
167713.555026
167362.682105
29502.974663
283424.866821
287491.05526
370439.524973
388644.801272
312571.800553
358715.224208
df.filter(
    col('rt').is_between(100, 725)
)
shape: (12, 3)
participantaccuracyrt
i64i64f64
144261.28435
167713.555026
167362.682105
29502.974663
283424.866821
287491.05526
370439.524973
388644.801272
312571.800553
358715.224208

Saving re-usable expressions#

When you find yourself creating complex expressions that you want to re-use later on, perhaps across other DataFrames, you can save them as re-usable functions!

For example, Polars doesn’t include an operations to calculate a z-score by default. But we know how to do that manually. So let’s create a function called scale that defines an expression we can reuse.

def scale(column_name):
    """Reminder:
        z-score = (x - x.mean() / x.std())
    """
    return (col(column_name) - col(column_name).mean()) / col(column_name).std()

This is a function that accepts a single argument column_name, and then uses the col selector from Polars to select a column and calculate it’s z-score.

We can now use this expression in any context saves us a ton of typing and typos!

For example just across all accuracy scores:

df.select(
    acc_z = scale('accuracy')
)
shape: (15, 1)
acc_z
f64
-0.446194
-0.335262
0.293354
0.404287
0.404287
0.515219
1.180812
1.180812
-1.629472
0.07149

Or as a more realistic example: z-score separately by participant

This is a great example of where .with_columns + .over() can come in super-handy.

Because our function returns an expression we can call operations on it just like any other expression:

# .over() works with our scale() function
# because it return a Polars expression!

df.with_columns(
    acc_z = scale('accuracy').over('participant'),
    rt_z = scale('rt').over('participant')
)
shape: (15, 5)
participantaccuracyrtacc_zrt_z
i64i64f64f64f64
144261.28435-1.216438-1.281041
147728.208489-0.9519950.634633
164801.8890160.5465150.936926
167713.5550260.8109580.574513
167362.6821050.810958-0.865031
370439.5249730.217085-0.175214
38855.119280.791722-1.646805
388644.8012720.7917220.610629
312571.800553-1.6345240.331166
358715.224208-0.1660060.880224

This is entirely equivalent to the following code, but so much easier to read and so much less potential for errors when typing:

df.with_columns(
    acc_z = (( col('accuracy') - col('accuracy').mean() )/col('accuracy').std()).over('participant'),
    rt_z = (( col('rt') - col('rt').mean() )/col('rt').std()).over('participant'),
)
shape: (15, 5)
participantaccuracyrtacc_zrt_z
i64i64f64f64f64
144261.28435-1.216438-1.281041
147728.208489-0.9519950.634633
164801.8890160.5465150.936926
167713.5550260.8109580.574513
167362.6821050.810958-0.865031
370439.5249730.217085-0.175214
38855.119280.791722-1.646805
388644.8012720.7917220.610629
312571.800553-1.6345240.331166
358715.224208-0.1660060.880224

As you’re thinking about how to manipulate data, think about saving an expression you find yourself using a lot as function!
In fact Python as a short-hand way to writing 1-line functions using the lambda keyword.


scale = lambda column_name: (col(column_name) - col(column_name).mean()) / col(column_name).std()

This is the same as the function we defined above with def, reproduced here:

def scale(column_name):
    """Reminder:
        z-score = (x - x.mean() / x.std())
    """
    return (col(column_name) - col(column_name).mean()) / col(column_name).std()

The structure of a one-line lambda function is:

name_you_want = lambda input_arg: body

And it’s possible to pass multiple inputs by separating them with a ,:

name_you_want = lambda input_arg1, input_arg_2: body

Your turn#

Create a Polars expression that mean-centers a column. You can use def or lambda whatever feels more comfortable right now

# Your code here
center = lambda name: col(name) - col(name).mean()

Add 2 new columns to the df DataFrame that include mean-centered accuracy, and mean-centered RT

# Your code here
df.with_columns(
    acc_centered = center('accuracy'),
    rt_centered = center('rt'),
)
shape: (15, 5)
participantaccuracyrtacc_centeredrt_centered
i64i64f64f64f64
144261.28435-12.066667-257.311396
147728.208489-9.066667209.612744
164801.8890167.933333283.293271
167713.55502610.933333194.95928
167362.68210510.933333-155.913641
370439.52497313.933333-79.070772
38855.1192831.933333-463.476466
388644.80127231.933333126.205526
312571.800553-44.06666753.204807
358715.2242081.933333196.628462
center = lambda *args: col(args) - col(args).mean()

df.with_columns(
    center('accuracy', 'rt').name.suffix('_centered')
)
shape: (15, 5)
participantaccuracyrtaccuracy_centeredrt_centered
i64i64f64f64f64
144261.28435-12.066667-257.311396
147728.208489-9.066667209.612744
164801.8890167.933333283.293271
167713.55502610.933333194.95928
167362.68210510.933333-155.913641
370439.52497313.933333-79.070772
38855.1192831.933333-463.476466
388644.80127231.933333126.205526
312571.800553-44.06666753.204807
358715.2242081.933333196.628462

More complex expression with functions: when and lit#

Polars also offers a few other operations as functions you can use inside of a context for building expressions.

These are called as pl.something() but we can also directly import them.

You should check out the documentation to see what’s possible, but two common ones you’re likely to use are pl.when and pl.lit

# Directly import them to make life easier
from polars import when, lit

when lets you run an if-else statement as an expression, which is particularly useful for creating new columns based on the values in another column.

lit works in conjunction with when to tell Polars to use the literal value of something rather than try to find a corresponding column name:

Let’s use them together to create a new column that splits participant responses that were faster and slower than 300ms:

We’ll use the .with_columns context, because we want the result of our expression (the new column) and the original DataFrame:

ddf = df.with_columns(  # <- another new context like .select

    # Create a new column rt_split that contains the result
    # of the following if/else statement:

    # If RT >= 300
    rt_split = when(col('rt') >= 300) # could have used .gte() instead

    # Set the value to the lit(eral) string 'slow'
    .then(lit('slow'))

    # Otherweise, set the value to the lit(eral) string 'fast'
    .otherwise(lit('fast'))
)

# We saved the output to a new variable called ddf
ddf
shape: (15, 4)
participantaccuracyrtrt_split
i64i64f64str
144261.28435"fast"
147728.208489"slow"
164801.889016"slow"
167713.555026"slow"
167362.682105"slow"
370439.524973"slow"
38855.11928"fast"
388644.801272"slow"
312571.800553"slow"
358715.224208"slow"

Your turn#

  1. Use when and lit to add a column to the dataframe called performance.

It should contain the string ‘success’ if accuracy >= 50, or ‘fail’ if it was < 50.

Save the result to a new dataframe called df_new and print the first 10 rows:

# Your code here
df_new = df.with_columns(
    performance = when(col('accuracy') >= 50)
    .then(lit('success'))
    .otherwise(lit('fail'))
)
df_new.head(10)
shape: (10, 4)
participantaccuracyrtperformance
i64i64f64str
144261.28435"fail"
147728.208489"fail"
164801.889016"success"
167713.555026"success"
167362.682105"success"
29502.974663"fail"
283424.866821"success"
221492.355273"fail"
236573.594895"fail"
287491.05526"success"
  1. Using the previous dataframe (df_new), summarize how many successes and failures each participant had.

Your result should have 6 rows: 2 for each participant

# Your code here


# Hint: you can group_by multiple columns by passing a list of column names, e.g. 

# df_new.group_by(['col_1', 'col_2']).agg(...)
# Solution
df_new.group_by(["participant", "performance"], maintain_order=True).agg(
    number_of_trials = col('performance').len()
)
shape: (6, 3)
participantperformancenumber_of_trials
i64stru32
1"fail"2
1"success"3
2"fail"3
2"success"2
3"success"4
3"fail"1

More complex expressions with attribute (type) operations#

In addition to importing functions to build more complicated expressions, Polars also allows you to perform specific operations based upon the type of data in a column.

You don’t need to import anything to use these. Instead you can use . syntax to “narrow down” to the type of data attribute you want, and then select the operations you would like.

For example, we’ll use the DataFrame we created in the previous section, ddf:

ddf.head()
shape: (5, 4)
participantaccuracyrtrt_split
i64i64f64str
144261.28435"fast"
147728.208489"slow"
164801.889016"slow"
167713.555026"slow"
167362.682105"slow"

To create an expression that converts each value in the new “rt_split” column to uppercase.

We can do this by selecting with col() as usual, but then before calling an operation with . like before, we first access the .str attribute, and then call operations that specifically operate on strings!

ddf.with_columns(
    col('rt_split').str.to_uppercase() # .uppercase() is only available to str data!
)
shape: (15, 4)
participantaccuracyrtrt_split
i64i64f64str
144261.28435"FAST"
147728.208489"SLOW"
164801.889016"SLOW"
167713.555026"SLOW"
167362.682105"SLOW"
370439.524973"SLOW"
38855.11928"FAST"
388644.801272"SLOW"
312571.800553"SLOW"
358715.224208"SLOW"

Without .str to “narrow-in” to the string attribute Polars will complain about an AttributeError, because only str types have a .to_uppercase() operation!

ddf.with_columns(
    col('rt_split').to_uppercase() # no .str
)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[72], line 2
      1 ddf.with_columns(
----> 2     col('rt_split').to_uppercase() # no .str
      3 )

AttributeError: 'Expr' object has no attribute 'to_uppercase'

Polars includes many attribute operations. The most common ones you’ll use are for working with:

.str: if your data are strings

.name: which allows you to quickly change the names of columns from within a more complicated expression.

.list: if you columns contain Python lists

For example, below we using a single expression inside the with_columns context below to calculate the mean of the judgement and rt columns.

df.with_columns(
    col('accuracy', 'rt').mean()
)
shape: (15, 3)
participantaccuracyrt
i64f64f64
156.066667518.595745
156.066667518.595745
156.066667518.595745
156.066667518.595745
156.066667518.595745
356.066667518.595745
356.066667518.595745
356.066667518.595745
356.066667518.595745
356.066667518.595745

Because we’re using .with_columns, the output of our expression is overwriting the original values in the accuracy and rt columns.

We saw how rename output when we had separate col('accuracy').mean() and col('rt').mean() expressions: using .alias() at the end or = at the beginning.

But how do we chage the names of both columns at the same time?

But accessing the .name attribute gives us access to additional operations that help us out. In this case we use the .suffix() operation to add a suffix to the output name(s).

df.with_columns(
    col('accuracy', 'rt').mean().name.suffix('_mean')
)
shape: (15, 5)
participantaccuracyrtaccuracy_meanrt_mean
i64i64f64f64f64
144261.2843556.066667518.595745
147728.20848956.066667518.595745
164801.88901656.066667518.595745
167713.55502656.066667518.595745
167362.68210556.066667518.595745
370439.52497356.066667518.595745
38855.1192856.066667518.595745
388644.80127256.066667518.595745
312571.80055356.066667518.595745
358715.22420856.066667518.595745

Now we have the original accuracy and rt columns and the newly named ones we created!

Building expressions from additional selectors#

So far we’ve seen how to use col() to select 1 or more columns we want to create an expression about.

But sometimes you need to select things in more complicated ways. Fortunately, Polars has additional selectors that we can use to express ourselves. A common pattern is to import these together using as:

from polars import selectors as cs

Then we can refer to these using cs.some_selector(). Some of these include:

  • cs.all()

  • cs.exclude()

  • cs.starts_with()

  • cs.string()

Let’s see some of these in action using a dataset that include a column of reaction times:

import polars.selectors as cs

df.select(
    cs.all().count()   # <- get all cols and calc count()
)
shape: (1, 3)
NameAgeSex
u32u32u32
333

This is as the same as the following code, but many fewer lines!

df.select(
    col('participant').count(),
    col('accuracy').count(),
    col('rt').count()
)
shape: (1, 3)
participantaccuracyrt
u32u32u32
151515

And cs.exclude is the opposite of cs.all()

df.select(
    cs.exclude('participant').mean() # <- all cols except participant
)
shape: (1, 2)
accuracyrt
f64f64
56.066667518.595745

We can select all columns that start with certain characters:

df.select(
    cs.starts_with('pa').n_unique()
)
shape: (1, 1)
participant
u32
3

Or even select columns based on the type of data they contain. In this case all the columns with Integer data:

df.select(
    cs.integer()
)
shape: (15, 2)
participantaccuracy
i64i64
144
147
164
167
167
370
388
388
312
358

There are a many more useful selectors. So check out the selector documentation page when you’re trying the challenge exercises later on in this notebook

Reshaping DataFrames#

Sometimes you’ll find yourself working “non-tidy” DataFrames or “wide” format data.

What’s tidy-data again?

  1. Each variable must have its own column.

  2. Each observation must have its own row.

  3. Each value must have its own cell.

In polars we can achieve this using:

.pivot(): long -> wide, similar to pivot_wider() in R
.unpivot(): wide -> long, similar to pivot_longer() in R
pl.concat(): combine 2 or more dataframes/columns/rows into a larger DataFrame

Here, I’ve generated data from two participants with three observations. This data frame is not tidy since each row contains more than a single observation.

df = pl.DataFrame({
    'participant': [1, 2],
    'observation_1': [10, 25],
    'observation_2': [100, 63],
    'observation_3': [24, 45],
})
df
shape: (2, 4)
participantobservation_1observation_2observation_3
i64i64i64i64
11010024
2256345

We can make it tidy by using the .unpivot method on the DataFrame, which takes 4 arguments:

on: the column(s) that contain values for each row
index: the column(s) to use as the identifier across rows
variable_name: name of the column that contains the original column names
value_name: name of the column that contains the values that were previous spread across columns

df_long = df.unpivot(
    on=cs.starts_with('observation'),
    index='participant', 
    variable_name='trial',
    value_name='rating'
    )
df_long
shape: (6, 3)
participanttrialrating
i64stri64
1"observation_1"10
2"observation_1"25
1"observation_2"100
2"observation_2"63
1"observation_3"24
2"observation_3"45

The .pivot method is the counter-part of .unpivot. We can use it to turn tidydata (long) to wide format. It takes 4 arguments as well:

on: the column(s) whos values will be turned into new columns
index: the column(s) that are unique rows in the new DataFrame
values: the values that will be moved into new columns with each row
aggregate_function: how to aggregate multiple rows within each index, e.g. None, mean, first, sum, etc

df_long.pivot(
    on='trial',
    index='participant', 
    values='rating',
    aggregate_function=None
    )
shape: (2, 4)
participantobservation_1observation_2observation_3
i64i64i64i64
11010024
2256345

You can safely set aggregate_function = None if you don’t have repeated observations within each unique combination of index and on. In this case each participant only has a single “observation_1”, “observation_2”, and “observation_3”.

But if they had multiple, Polars will raise and error and ask you to specify how to aggregate them

Splitting 1 column into many#

Sometimes you’ll need to split one column into multiple columns.
Let’s say we wanted to split the “year_month” column into 2 separate columns of “year” and “month”:

df = pl.DataFrame({
    "id": [1, 2, 3],
    "year_month": ["2021-01", "2021-02", "2021-03"]
})
df
shape: (3, 2)
idyear_month
i64str
1"2021-01"
2"2021-02"
3"2021-03"

You can use attribute operations for .str to do this!

Specifically we use can .split_exact to split a str into a n+1 parts.

df_split = df.with_columns(
    # string attribute method, to split by delimiter "-" into 2 parts
    col("year_month")
    .str.split_exact("-", 1)
)
df_split
shape: (3, 2)
idyear_month
i64struct[2]
1{"2021","01"}
2{"2021","02"}
3{"2021","03"}

Polars stores these parts in a struct which is just a Python dictionary:

# First row, second column value
df_split[0, 1]
{'field_0': '2021', 'field_1': '01'}

Polars provides additional attribute operations on the .struct to create new columns.

First we’ll call .rename_field to rename the fields of the struct (equivalent to renaming the keys of a Python dictionary).

df_split = df.with_columns(
    col("year_month")
    # string attribute method, to split by delimiter "-" into 2 parts
    .str.split_exact("-", 1)

    # struct attribute method to rename fields
    .struct.rename_fields(["year", "month"])
)
df_split
shape: (3, 2)
idyear_month
i64struct[2]
1{"2021","01"}
2{"2021","02"}
3{"2021","03"}
# First row, second column value
df_split[0, 1]
{'year': '2021', 'month': '01'}

Then we’ll call struct.unnest() to create new columns, 1 per field

df_split = df.with_columns(
    col("year_month")
    # string attribute method, to split by delimiter "-" into 2 parts
    .str.split_exact("-", 1)

    # struct attribute method to rename fields
    .struct.rename_fields(["year", "month"])

    # struct attribute method to create 1 column per field
    .struct.unnest()
)
df_split
shape: (3, 4)
idyear_monthyearmonth
i64strstrstr
1"2021-01""2021""01"
2"2021-02""2021""02"
3"2021-03""2021""03"

We can also split up values in a column over rows .explode('column_name') method on the DataFrame itself:

df = pl.DataFrame(
    {
        "letters": ["a", "a", "b", "c"],
        "numbers": [[1], [2, 3], [4, 5], [6, 7, 8]],
    }
)
df
shape: (4, 2)
lettersnumbers
strlist[i64]
"a"[1]
"a"[2, 3]
"b"[4, 5]
"c"[6, 7, 8]
df.explode('numbers')
shape: (8, 2)
lettersnumbers
stri64
"a"1
"a"2
"a"3
"b"4
"b"5
"c"6
"c"7
"c"8

Combining many columns into 1#

We can combine columns into a single column using additional functions in an expression like pl.concat_list() and pl.concat_str(), which take column names as input:

df_split.with_columns(
    month_year = pl.concat_str('month', 'year', separator='-')
)
shape: (3, 5)
idyear_monthyearmonthmonth_year
i64strstrstrstr
1"2021-01""2021""01""01-2021"
2"2021-02""2021""02""02-2021"
3"2021-03""2021""03""03-2021"

Polars also includes various functions that end with _horizontal.

Like the suffix implies, these functions are design to operate horizontally across columns within each row separately.

Let’s say our DataFrame had these 3 numeric columns:

import numpy as np # to generate data

df = df.with_columns(
    a=np.random.normal(size=df.height),
    b=np.random.normal(size=df.height),
    c=np.random.normal(size=df.height),
)
df
shape: (4, 5)
lettersnumbersabc
strlist[i64]f64f64f64
"a"[1]0.2590861.0451840.351679
"a"[2, 3]-0.785854-1.131320.072687
"b"[4, 5]-0.906119-0.7680750.068268
"c"[6, 7, 8]0.3259251.156249-1.763849

And we want to create a new column that is the average of these 3 columns within each row. We can easily to that using a horizontal function like pl.mean_horizontal

df.with_columns(
    abc_mean = pl.mean_horizontal('a', 'b', 'c')
)
shape: (4, 6)
lettersnumbersabcabc_mean
strlist[i64]f64f64f64f64
"a"[1]0.2590861.0451840.3516790.551983
"a"[2, 3]-0.785854-1.131320.072687-0.614829
"b"[4, 5]-0.906119-0.7680750.068268-0.535309
"c"[6, 7, 8]0.3259251.156249-1.763849-0.093892

Your turn#

Make the following DataFrame “tidy”, i.e. long-format with 4 columns:

  • participant: integer of participant ID

  • order: integer stimulus and observation order (from column names)

  • stimulus: string of stimulus name

  • observation: float of numeric rating ech participant gave

reshape = pl.DataFrame({
    'participant': [1., 2.],
    'stimulus_1': ['flower', 'car'],
    'observation_1': [10., 25.,],
    'stimulus_2': ['house', 'flower'],
    'observation_2': [100., 63.,],
    'stimulus_3': ['car', 'house'],
    'observation_3': [24., 45.,]
})
reshape
shape: (2, 7)
participantstimulus_1observation_1stimulus_2observation_2stimulus_3observation_3
f64strf64strf64strf64
1.0"flower"10.0"house"100.0"car"24.0
2.0"car"25.0"flower"63.0"house"45.0

Hints

Think about this as a sequence of 4 steps. We’ve created 4 code cells below with an image above each of the expected result:

  1. unpivot wide -> long

Figure 1
# Your code here
# Solution
reshape.unpivot(
    index='participant',
    on=cs.contains('_'),
    variable_name='trial',
    value_name='rating'
)
shape: (12, 3)
participanttrialrating
f64strstr
1.0"stimulus_1""flower"
2.0"stimulus_1""car"
1.0"observation_1""10.0"
2.0"observation_1""25.0"
1.0"stimulus_2""house"
2.0"observation_2""63.0"
1.0"stimulus_3""car"
2.0"stimulus_3""house"
1.0"observation_3""24.0"
2.0"observation_3""45.0"
  1. split the variable_name column from the previous step (I called it “trial”) into 2 new columns by ‘_’ (which I called “index” and “order”)

Figure 1
# Your code here
# Solution
reshape.unpivot(
    index='participant',
    on=cs.contains('_'),
    variable_name='trial',
    value_name='rating'
).with_columns(
    col('trial').str.split_exact('_', 1)
    .struct.rename_fields(['index', 'order'])
    .struct.unnest()
)
shape: (12, 5)
participanttrialratingindexorder
f64strstrstrstr
1.0"stimulus_1""flower""stimulus""1"
2.0"stimulus_1""car""stimulus""1"
1.0"observation_1""10.0""observation""1"
2.0"observation_1""25.0""observation""1"
1.0"stimulus_2""house""stimulus""2"
2.0"observation_2""63.0""observation""2"
1.0"stimulus_3""car""stimulus""3"
2.0"stimulus_3""house""stimulus""3"
1.0"observation_3""24.0""observation""3"
2.0"observation_3""45.0""observation""3"
  1. select only the columns: participant, 2 you created (I called mine “index” and “order”), and the value_name column from the first step (I called it “rating”)

Figure 1
# Your code here
reshape.unpivot(
    index='participant',
    on=cs.contains('_'),
    variable_name='trial',
    value_name='rating'
).with_columns(
    col('trial').str.split_exact('_', 1)
    .struct.rename_fields(['index', 'order'])
    .struct.unnest()
).select(
    col('participant').cast(int),
    col('index'),
    col('order').cast(int),
    col('rating')
)
shape: (12, 4)
participantindexorderrating
i64stri64str
1"stimulus"1"flower"
2"stimulus"1"car"
1"observation"1"10.0"
2"observation"1"25.0"
1"stimulus"2"house"
2"observation"2"63.0"
1"stimulus"3"car"
2"stimulus"3"house"
1"observation"3"24.0"
2"observation"3"45.0"
  1. pivot long -> wide to breakout the value_name column (I called it “rating”) into multiple columns

Figure 1
# Your code here
# Solution
reshape.unpivot(
    index='participant',
    on=cs.contains('_'),
    variable_name='trial',
    value_name='rating'
).with_columns(
    col('trial').str.split_exact('_', 1)
    .struct.rename_fields(['index', 'order'])
    .struct.unnest()
).select(
    col('participant').cast(int),
    col('index'),
    col('order').cast(int),
    col('rating')
).pivot(
    on='index',
    values='rating',
    index=['participant', 'order'], 
    aggregate_function=None
).with_columns(
    col('observation').cast(float)
)
shape: (6, 4)
participantorderstimulusobservation
i64i64strf64
11"flower"10.0
21"car"25.0
12"house"100.0
22"flower"63.0
13"car"24.0
23"house"45.0

Appendix#

Additional Resources#

Here a few additional resources you might helpful as you try out the challenge notebook later:

Here are a few other resources:

Interactive DataFrames#

Since Polars DataFrames are not interactive inside this notebook, we can use a function called show() from a neat library called itables, which will give us an interactive viewer for the DataFrame.

This won’t allow us to set or change values, but can be helpful to interactive scroll/search through a DataFrame when you’re first checking it out.

from itables import show
show(df)

WARNING

Make sure to delete any cell you use show() inside of before commit-ing your notebook.

The interactive viewer can make the file-size of your notebook much larger.

And since you probably only need while working interactively, you can avoid a headache and just delete the cell when you’re done.

Use direct imports with caution#

In most of the help you’ll see online and in the official polars documentation, commands use pl.col, but we’ve been using col. What gives?

Python gives us a few tricks when we import a library to save ourselves some keystrokes when typing the same commands over and over again. One of those tricks is the as statement. This allows us to name something we’re importing by another name of our choosing.

What you’ve encountered before: import polars as pl and import numpy as np makes it so we can use pl to refer to polars and np to refer to numpy.

That makes it much easier to not have to type out something like numpy.random.normal and just use np .random.normal.

You can use as to name things whatever you want. But take caution. If you import something as something that already exists and has the same name you will overwrite it!:

my_name = 'eshin'
import numpy as my_name

my_name # is now the same as numpy!!!

Using import polars as pl, makes everything polars has available to us by prefixing with pl., for example pl.col().

But sometimes we use pl.col() so often, it would be nice to just type it as col(). We can do this using the from statement to directly import any functionality from a library:

import polars as pl
from polars import col # add this line

Now col is available to us and is exactly the same as pl.col.

As we learn about other polars functionality, you may find it helpful to directly import other functionality you use often. You can also combine this with as to make things even a bit cleaner:

import polars as pl
from polars import col as c

df.select(
  c('judgement').mean()
)

But again take caution. If you create a new variable c anywhere in your notebook, e.g c = 3, it will overwrite what you imported from polars!

Coding style is a personal preference, but there are also conventions for a reason - communication:
so other people (including your future self) can read what you wrote.

Make things as concise and clear as you can, but don’t get carried away. Python’s big advantage compared to other langauges is that it’s very easy to read and get a gist for what’s going on.

So as you using things like from and as, think about whether your code stays readable:

# Ummmm....
from numpy.random import normal as n
import polars as p
from polars import col as c

# Ehhh...
l, s, n = 0, 1, df.shape[0]

# How easily can someone else read this? Can you decipher it in 6mo?
df.select(p.all().add(n(l,s,n)))