Practical SQL for Data Analysis

What you can do without Pandas


Pandas is a very popular tool for data analysis. It comes built-in with many useful features, it's battle tested and widely accepted. However, pandas is not always the best tool for the job.

SQL databases has been around since the 1970s. Some of the smartest people in the world worked on making it easy to slice, dice, fetch and manipulate data quickly and efficiently. SQL databases have come such a long way, that many developers and data scientists lost track of what they can do with the database they already have!

In this article I demonstrate how to use SQL to perform fast and efficient data analysis.

<small>Illustration by <a href="https://weareskribbl.com/mindfulness/">Victoria Holmes</a></small>
Illustration by Victoria Holmes

Table of Contents


SQL vs Pandas Performance

Imagine a simple table with 1M users, each with a username and an indication if the user was activated or not. A simple data analysis task would be to answer how many activated and inactivated users are there?

⚙ Benchmark setup

Create the table and populate with random data:

CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    username TEXT NOT NULL,
    activated BOOLEAN NOT NULL
);

INSERT INTO users (
    username,
    activated
)
SELECT
    md5(random()::text)::text AS username,
    random() > 0.9 AS activated
FROM
    generate_series(1, 1000000);

Setup a python virtual environment and install dependencies:

$ python -m venv venv
$ source venv/bin/activate
$ pip install psycopg2 pandas memory-profiler

To produce benchmark results, create a script with the following pattern:

# bench.py
from memory_profiler import profile

@profile
def run():
    # TODO: Replace with code to benchmark
    print('do work!')

if __name__ == "__main__":
    run()

Execute your script from the command like and view the row level results:

$ python bench.py
do work!
Filename: bench.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     3     38.9 MiB     38.9 MiB           1   @profile
     4                                         def run():
     5                                             # TODO: Replace with code to benchmark
     6     38.9 MiB      0.0 MiB           1       print('do work!')

You can find more details about this method in this article or in the memory-profiler documentation.


Let's start with a naive approach using pandas:

import psycopg2
import pandas as pd

connection = psycopg2.connect(dbname='db')
with connection.cursor() as cursor:
    cursor.execute('SELECT * FROM users')
    df = pd.DataFrame(
        cursor.fetchall(),
        columns=['id', 'username', 'activated'],
    )

result = df.groupby(by='activated').count()
print(result)

The script uses psycopg2 to create a connection to the database. It then fetches data from the users table into a pandas dataframe, and calls groupby to get the counts for activated and inactivated users.

This script produces the following output:

               id  username
activated
False      900029    900029
True        99971     99971

We got an answer to our question, but at what cost?

Pandas True Cost

Let's execute this function again, but this time look at the memory usage:

(venv) $ python test_pandas_naive.py
               id  username
activated
False      900029    900029
True        99971     99971
Filename: test_pandas_naive.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     3     38.8 MiB     38.8 MiB           1   @profile
     4                                         def run():
     5     41.2 MiB      2.3 MiB           1       import psycopg2
     6     78.2 MiB     37.1 MiB           1       import pandas as pd
     7
     8     78.6 MiB      0.4 MiB           1       connection = psycopg2.connect(dbname='db')
     9     78.6 MiB      0.0 MiB           1       with connection.cursor() as cursor:
    10    179.9 MiB    101.3 MiB           1           cursor.execute('SELECT * FROM users')
    11    386.0 MiB     12.7 MiB           2           df = pd.DataFrame(
    12    373.4 MiB    193.5 MiB           1               cursor.fetchall(),
    13    373.4 MiB      0.0 MiB           1               columns=['id', 'username', 'activated'],
    14                                                 )
    15
    16    386.0 MiB      0.0 MiB           1       result = df.groupby(by='activated').count()
    17    386.0 MiB      0.0 MiB           1       print(result)

To view the memory usage of the program we use the package memory-profiler. We used this technique in the past to find the fastest way to load data into PostgreSQL using Python.

The output shows the overall memory usage for each row in the program, as well as the additional memory added by each row in the column "Increment". The output for this program reveals some interesting finds:

  1. Pandas alone consumes ~37M of memory: Just importing pandas, before even doing anything with it, consumes a significant amount of memory. For comparison, importing psycopg2 only adds 2.3MB of memory to the program.

  2. Fetching the data into memory consumed an additional ~300MB: When we fetched the data into memory, and then into a pandas dataframe, the program occupied an additional 300MB. For reference, the size of the table in the database is only 65MB.

If we ignore the 38MB consumed by the profiler itself, the program consumed 347MB of memory, and executing this script without the profiler took 1.101s to complete.

Removing Unnecessary Data

Our quick analysis showed that fetching the data consumed the most memory. To optimize that, we can try to fetch less data. For example, we don't really use the username column, so maybe we can not fetch it from the database:

(venv) $ python test_pandas.py
               id
activated
False      900029
True        99971
Filename: test_pandas.py

 #    Mem usage    Increment  Occurences   Line Contents
========================================================
 3     38.7 MiB     38.7 MiB           1   @profile
 4                                         def run():
 5     41.0 MiB      2.3 MiB           1       import psycopg2
 6     78.3 MiB     37.2 MiB           1       import pandas as pd
 7
 8     78.6 MiB      0.3 MiB           1       connection = psycopg2.connect(dbname='db')
 9     78.6 MiB      0.0 MiB           1       with connection.cursor() as cursor:
10    132.1 MiB     53.6 MiB           1           cursor.execute('SELECT id, activated FROM users')
11    142.1 MiB    -90.7 MiB           2           df = pd.DataFrame(
12    232.8 MiB    100.7 MiB           1               cursor.fetchall(),
13    232.8 MiB      0.0 MiB           1               columns=['id', 'activated'],
14                                                 )
15
16    142.1 MiB      0.0 MiB           1       result = df.groupby(by='activated').count()
17    142.1 MiB      0.0 MiB           1       print(result)

By explicitly providing a list of columns to the query and fetching only what we need, the program now consumes only 232MB, or 193MB without the overhead of the profiler. This is an improvement from the previous attempt which consumed 347MB of memory.

Executing the script without the profiler took 0.839s compared to the previous program which took 1.1s.

Aggregating in the Database

The most memory in the program is still the data being fetched into memory. What if instead of first fetching the data and aggregating using pandas, we would aggregate the data in the database, and create a pandas dataframe from the results:

$ python test_db.py
   activated     cnt
0      False  900029
1       True   99971
Filename: test_db.py

 #    Mem usage    Increment  Occurences   Line Contents
========================================================
 3     38.6 MiB     38.6 MiB           1   @profile
 4                                         def run():
 5     41.0 MiB      2.3 MiB           1       import psycopg2
 6     78.0 MiB     37.0 MiB           1       import pandas as pd
 7
 8     78.3 MiB      0.4 MiB           1       connection = psycopg2.connect(dbname='db')
 9     78.3 MiB      0.0 MiB           1       with connection.cursor() as cursor:
10     78.3 MiB      0.0 MiB           1           cursor.execute('''
11                                                     SELECT activated, count(*) AS cnt
12                                                     FROM users
13                                                     GROUP BY activated
14                                                 ''')
15     78.3 MiB      0.0 MiB           2           result = pd.DataFrame(
16     78.3 MiB      0.0 MiB           1               cursor.fetchall(),
17     78.3 MiB      0.0 MiB           1               columns=['activated', 'cnt'],
18                                                 )
19     79.3 MiB      1.0 MiB           1       print(result)

This is a big leap compared to the previous attempt. Doing the processing in the database and fetching aggregated results consumed only 79MB of memory, or 40MB if we remove the overhead of the profiler. This is a big improvement!

Executing the script without the profiler took 0.380s, which is twice as fast as the previous program which took 0.839s.

Removing Pandas

At this point the only significant memory hog is pandas itself. Just for fun and reference, let's see what the program consumes without pandas:

$ python test_db_plain.py
([(False, 900029), (True, 99971)],)
Filename: test_db_plain.py

 #    Mem usage    Increment  Occurences   Line Contents
========================================================
 3     38.9 MiB     38.9 MiB           1   @profile
 4                                         def run():
 5     41.2 MiB      2.4 MiB           1       import psycopg2
 6
 7     41.7 MiB      0.5 MiB           1       connection = psycopg2.connect(dbname='db')
 8     41.7 MiB      0.0 MiB           1       with connection.cursor() as cursor:
 9     41.7 MiB      0.0 MiB           1           cursor.execute('''
10                                                     SELECT activated, count(*) AS cnt
11                                                     FROM users
12                                                     GROUP BY activated
13                                                 ''')
14     41.7 MiB      0.0 MiB           1           result = cursor.fetchall(),
15
16     41.7 MiB      0.0 MiB           1       print(result)

After removing pandas and keeping the results as a python list of tuples, the program consumes 41MB of memory, or just 2.8MB if we ignore the profiler overhead. This is a huge difference from where we started!

The timing is also much lower. Without the profile this program completes in just 0.114s. That's 70% less than the previous attempt using pandas, and overall 90% faster than the first program.

Results Summary

This is the a summary of the results:

Program Peak Memory % Memory Diff Runtime % Runtime Diff
Pandas with entire table 347 MB 1.101s
Pandas with only necessary data 193 MB -44% 0.839s -23%
Pandas with aggregation in database 40 MB -80% 0.380s -54%
No Pandas, aggregation in database 2.3 MB -94% 0.114s -70%

This benchmark does not mention the memory consumed by the database itself - this is intentional. Databases usually consume a configurable amount of memory, and than manage allocations between different buffers and system components internally. Over the years, databases have gotten pretty good at managing their memory so you won't have to. Whether you decide to use the database or not, the memory is already paid for, so you might as well use it!

Pandas and SQL: Better Together!

Programs that consume a lot of memory are a huge pain. Developers need powerful development environments, iterations are slower and the entire process takes more time. From an infrastructure perspective, resources cost money, and the more you scale the more you have to pay. The costs pile up pretty quickly.

All of this is not to say that Pandas is unnecessary, or that it can be replaced. Pandas provide great benefits and it has proven itself as being incredibly valuable. The same thing can be said for databases.

To take advantage of both worlds and create lightweight programs that are also fast, use SQL and Pandas together!

I'm focusing on Pandas and Numpy because they are the most popular, but the concepts described in the article apply to other tools and languages such as R, Julia, Matlab, SAS and so on. To make the argument even more compelling, I include interactive Hex Notebooks you can experiment with on your own.


Basics

The SQL query language was invented more than 40 years ago, and it is the most popular language for querying relational data. SQL is defined in an ANSI standard but there are still subtle differences between popular database engines such as PostgreSQL, MySQL, Oracle, SQL Server and others.

These are the common clauses of an SQL query:

SELECT <expressions>
FROM <tables>
JOIN <to other table> ON <join condition>
WHERE <predicates>
GROUP BY <expressions>
HAVING <predicate>
ORDER BY <expressions>
LIMIT <number of rows>

In PostgreSQL only the SELECT clause is really mandatory, so you can mix and match to do what you want.

Common Table Expressions

It's sometimes useful to split a large query into smaller steps. Using SQL, you can define a common table expression or "CTE" in short, with the WITH clause:

WITH emails AS (
    SELECT 'ME@hakibenita.com' AS email
)
SELECT * FROM emails;

       email
───────────────────
 ME@hakibenita.com

You can have multiple CTE's in a single query, and they can even depend on each other:

WITH emails AS (
    SELECT 'ME@hakibenita.com' AS email
),

normalized_emails AS (
    SELECT lower(email) AS email FROM emails
)

SELECT * FROM normalized_emails;

       email
───────────────────
 me@hakibenita.com

Common table expressions are a great way to split a big query into smaller chunks, perform recursive queries and even to cache intermediate results!

Generating Data

Generating data is very handy. Sometimes you need to generate data for practice, sometime you need to generate a time series or a small table to join to. There are several ways to generate data in SQL:

UNION ALL

WITH dt AS (
    SELECT 1 AS id, 'haki' AS name
    UNION ALL
    SELECT 2, 'benita'
)
SELECT * FROM dt;

 id │  name
────┼────────
  1 │ haki
  2 │ benita

Using UNION ALL you can combine, or concatenate, the results of multiple queries.

Concatenating query results is very common, but it can be a bit tedious for generating data.

VALUES LIST

WITH dt AS (
    SELECT * FROM (
        VALUES
            (1, 'haki'),
            (2, 'benita')
    ) AS t(id, name)
)
SELECT * from dt;

Using the VALUES keyword you can provide a list of rows, and then define names and types using a "table alias list" t(..). The t can be any name. Using a VALUES list is very useful when you need to generate small sets of data, or as the documentation calls it, "constants table".

UNNEST

To generate small sets of one dimensional data, you can unnest a PostgreSQL array:

WITH dt AS (
    SELECT unnest(array[1, 2]) AS n
)
SELECT * FROM dt;

 n
───
 1
 2

This is more restricting than VALUES as it can only produce a one dimensional table of the same datatype, but we are going to use it later.

GENERATE_SERIES

To generate large amounts of data, PostgreSQL provides a table function called generate_series:

WITH dt AS (
    SELECT * FROM generate_series(0, 5) AS t(n)
)
SELECT * FROM dt;

 n
──
 0
 1
 2
 3
 4
 5

The function generate_series accepts three arguments: start, stop and step. In the example above we did not specify a step, so the default 1 was used. We can provide a different step to generate a different series:

WITH dt AS (
    SELECT * FROM generate_series(
        0,  -- start
        10, -- stop
        2   -- step
    ) AS t(n)
)
SELECT * FROM dt;

 n
────
  0
  2
  4
  6
  8
 10

To generate a list of even numbers, we set the step to 2.

The function generate_series is not restricted just to integers, it can be used for other types as well. One common examples is generating date ranges:

WITH daterange AS (
    SELECT *
    FROM generate_series(
        '2021-01-01 UTC'::timestamptz, -- start
        '2021-01-02 UTC'::timestamptz, -- stop
        interval '1 hour'   -- step
    ) AS t(hh)
) SELECT * FROM daterange;

           hh
────────────────────────
 2021-01-01 00:00:00+00
 2021-01-01 01:00:00+00
 2021-01-01 02:00:00+00
 ...
 2021-01-01 22:00:00+00
 2021-01-01 23:00:00+00
 2021-01-02 00:00:00+00

To generate a 24 hour range we provided generate_series with a start and end data, and set the step to a 1 hour interval.

GENERATE_SERIES with row numbers

As mentioned above, generate_series is a "table function". There is a little trick with table functions to include row numbers in the result:

WITH daterange AS (
    SELECT *
    FROM generate_series(
        '2021-01-01'::timestamptz, -- start
        '2021-01-02'::timestamptz, -- stop
        interval '1 hour'   -- step
    ) WITH ORDINALITY AS t(hh, n)
) SELECT * FROM daterange;

           hh           │ n
────────────────────────┼────
 2021-01-01 00:00:00+00 │  1
 2021-01-01 01:00:00+00 │  2
 2021-01-01 02:00:00+00 │  3
 ...
 2021-01-01 22:00:00+00 │ 23
 2021-01-01 23:00:00+00 │ 24
 2021-01-02 00:00:00+00 │ 25

Using WITH ORDINALITY, the results now include another column with the row number.

Random

To generate random numbers PostgreSQL provides a random function that returns a value between 0 and 1:

SELECT random();
0.5917508391168769

To generate values at different ranges you can random in an expression:

-- Random float between 0 and 100
SELECT random() * 100;
59.17508391168769

-- Random integer between 1 and 100
SELECT ceil(random() * 100);
59

-- Random integer between 11 and 100
SELECT 10 + ceil(random() * 90);
59

It's a common mistake to use round instead of ceil or floor to generate a range of integers. Using round may produce inconsistent distribution. Consider the following query to generate random integers in the range 0 - 4 using round instead ceil:

SELECT
    round(random() * 3) AS n,
    count(*)
FROM
    generate_series(0, 1000)
GROUP BY
    1;

n │ count
──┼───────
0 │   150
1 │   328
2 │   341
3 │   182

Notice how the values 0 and 3 are coming up less than 1 and 2. Using round, random values less than 0.5 will be rounded down to 0, and random numbers greater than 2.5 will be rounded up to 3, while for example, random values between 0.5 and 1.5 will be rounded to 1. This makes the edges less likely to come up.

01230123

Random distribution using round

This problem can be solved by either rounding up or down. Consider the same query using ceil:

SELECT
    ceil(random() * 3) AS n,
    count(*)
FROM
    generate_series(0, 1000)
GROUP BY
    1;

n │ count
──┼───────
1 │   328
2 │   339
3 │   334

Using ceil produces more evenly distributed random numbers.

0123123

Random distribution using ceil

Random Choice

You can use the random function to pick a random value from a list of values:

SELECT
    (array['red', 'green', 'blue'])[ceil(random() * 3)] AS color
FROM
    generate_series(1, 5);

 color
───────
 green
 green
 blue
 green
 blue

The expression defines an array of colors, and then uses random to get a random element from the array. Notice that in PostgreSQL, arrays start at 1:

-- In PostgreSQL arrays start at 1
SELECT (array['red', 'green', 'blue'])[1];

 array
───────
 red

Sampling

Sampling a random portion of a table is a very common when training a model. A simple way to fetch a random portion of a table is combining random with LIMIT:

db=# WITH sample AS (
    SELECT *
    FROM users
    ORDER BY random() LIMIT 10000
)
SELECT count(*) FROM sample;
 count
───────
 10000
(1 row)

Time: 205.643 ms

To sample 10K random rows from the table you first sort in a random order, and then take the first 10K rows.

Using random to sample data is great, but for very large datasets it can be inefficient. PostgreSQL provides other methods of sampling a proportion of a table, which are more suited for large tables.

PostgreSQL provides two sampling methods, SYSTEM and BERNOULLI. To sample a table, use the TABLESAMPLE keyword in the FROM clause, and provide the sampling method along with it's arguments. For example, sampling 10% of the table using the SYSTEM sampling method:

db=# WITH sample AS (
    SELECT *
    FROM users TABLESAMPLE SYSTEM(10)
)
SELECT count(*) FROM sample;

 count
───────
 95400
(1 row)

Time: 13.690 ms

The SYSTEM sampling method works by sampling blocks rather than rows, which makes it very fast. The table we sampled contains 1M rows, and the sample returned slightly less than 100K rows. For large datasets it's not uncommon to compromise accuracy for performance.

Another sampling method provided by PostgreSQL is BERNOULLI:

db=# WITH sample AS (
    SELECT *
    FROM users TABLESAMPLE BERNOULLI(10)
)
SELECT count(*) FROM sample;

 count
────────
 100364
(1 row)

Time: 54.593 ms

Unlike the SYSTEM sampling method, BERNOULLI works at the row level which makes it a bit slower, but the results are better distributed.

These are the timings for sampling 10% of table with 1M rows using different sampling methods:

Sampling Method Timing
random() 205ms
BERNOULLI 54ms
SYSTEM 13ms

If you need to sample from a large table consider using TABLESAMPLE.

Example: Train / Test Split with SQL

A common task when analyzing data is to split a dataset for training and testing. The training dataset is used to train the model, and the test dataset is used to evaluate the model.

To put what you've seen so far to practice, generate a transactions table with some random data:

CREATE TABLE transaction AS
    SELECT
        id,
        '2021-01-01'::date + interval '1 day' * ceil(random() * 365) as billed_at,
        round(10 + 90 * random()) as charged_amount,
        random() > 0.6 as reported_as_fraud
    FROM
        generate_series(1, 10) AS id
    ORDER BY
        1;

The transaction table include the date and amount of the transaction, and an indication whether the transaction was reported as fraudulent.

Before we move on, let's break it down:

SELECT '2021-01-01'::date + interval '1 day' * ceil(random() * 365) as billed_at;

Add a random number of days between 0 and 365 to January 1st, 2021 to produce a random date in that year.

SELECT round(10 + 90 * random()) as charged_amount;

Produce a random round charged amount between 10 and 100.

SELECT random() > 0.6 as reported_as_fraud;

Produce the parameter we want to estimate. In our fake data, we want to have 40% fraudulent transactions. Using an expression we produce a boolean value which will evaluate to true ~40% of the times.

This is what the data looks like:

db=# SELECT * FROM transaction;

 id │      billed_at      │ charged_amount │ reported_as_fraud
────┼─────────────────────┼────────────────┼───────────────────
  1 │ 2021-05-22 00:00:00 │             54 │ t
  2 │ 2021-05-31 00:00:00 │             63 │ f
  3 │ 2021-11-11 00:00:00 │             26 │ t
  4 │ 2021-07-04 00:00:00 │             64 │ t
  5 │ 2021-02-27 00:00:00 │             90 │ t
  6 │ 2021-05-21 00:00:00 │             20 │ t
  7 │ 2021-07-29 00:00:00 │             69 │ t
  8 │ 2021-02-24 00:00:00 │             20 │ f
  9 │ 2021-05-07 00:00:00 │             36 │ f
 10 │ 2021-05-05 00:00:00 │             38 │ f

To test a model which classifies transactions as fraudulent, we want to split the table into a training and test datasets. One way to do that is adding a column, but we are going to create two separate tables instead.

To create a table similar to an existing table in PostgreSQL, you can use the following commands:

CREATE TABLE transaction_training AS TABLE transaction WITH NO DATA;
CREATE TABLE transaction_test AS TABLE transaction WITH NO DATA;

This is a really handy syntax! We simply tell PostgreSQL to create a table similar to another table, but with no data.

Next, we want to split the data in the transaction table between transaction_training and transaction_test. We want our training set to include 80% of the rows, in this case 8 rows:

WITH
training_transaction_ids AS (
    INSERT INTO transaction_training
    SELECT * FROM transaction
    ORDER BY random() LIMIT 8
    RETURNING id
)

INSERT INTO transaction_test
SELECT * FROM transaction
WHERE id NOT IN (SELECT id FROM training_transaction_ids);

To populate data for training we select from the transaction table, shuffle the rows using ORDER BY random() and then insert into transaction_training just the first 8 rows.

To insert only the remaining rows into the test table, we keep the ids of the training rows by specifying RETURNING id in a common table expression (the WITH clause). We then insert rows into transaction_test and exclude rows in training_transaction_ids. For more on this technique check out how to implement complete processes using WITH and RETURNING.

This is the result:

db=# SELECT * FROM transaction_training;
 id │      billed_at      │ charged_amount │ reported_as_fraud
────┼─────────────────────┼────────────────┼───────────────────
  6 │ 2021-05-21 00:00:00 │             20 │ t
  4 │ 2021-07-04 00:00:00 │             64 │ t
  5 │ 2021-02-27 00:00:00 │             90 │ t
  2 │ 2021-05-31 00:00:00 │             63 │ f
 10 │ 2021-05-05 00:00:00 │             38 │ f
  3 │ 2021-11-11 00:00:00 │             26 │ t
  9 │ 2021-05-07 00:00:00 │             36 │ f
  7 │ 2021-07-29 00:00:00 │             69 │ t
(8 rows)

db=# SELECT * FROM transaction_test;
 id │      billed_at      │ charged_amount │ reported_as_fraud
────┼─────────────────────┼────────────────┼───────────────────
  1 │ 2021-05-22 00:00:00 │             54 │ t
  8 │ 2021-02-24 00:00:00 │             20 │ f
(2 rows)

And there you have it, a training dataset and a test dataset with SQL, directly in the database!


Descriptive Statistics

When you get a fresh data set, the first thing you usually want to do is get familiar with it. Some people call this "Exploratory data analysis", or EDA for short. Pandas, as well as other languages and tools, provide some utility functions to produce descriptive statistics.

Describing a Series

Describing a numeric series using pandas:

>>> import pandas as pd
>>> s = pd.Series([1, 2, 3])
>>> s.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
dtype: float64

To generate descriptive statistics in SQL, you can use the following query:

WITH s AS (
    SELECT * FROM (VALUES (1), (2), (3)) AS t(n)
)
SELECT
    count(*),
    avg(n),
    stddev(n),
    min(n),
    percentile_cont(array[0.25, 0.5, 0.75]) WITHIN GROUP (ORDER BY n),
    max(n)
FROM
    s;

count │   avg  │   stddev  │ min │ percentile_cont │ max
──────┼────────┼───────────┼─────┼─────────────────┼─────
    3 │ 2.0000 │ 1.0000000 │   1 │ {1.5,2,2.5}     │   3

Basic aggregate functions in SQL produced a similar output to that of Pandas. The interesting part here is function percentile_cont.

The function percentile_cont is an Ordered-Set Aggregate Function, meaning, it operates with respect to some order. To illustrate, in the query above you can replace both min and max with percentile_cont:

WITH s AS (
    SELECT * FROM (VALUES (1), (2), (3)) AS t(n)
)
SELECT
    percentile_cont(array[
        0,                  -- <--- min
        0.25, 0.5, 0.75,
        1                   -- <--- max
    ]) WITHIN GROUP (ORDER BY n),
FROM
    s;

 percentile_cont
─────────────────
 {1,1.5,2,2.5,3}

Another common use for percentile_cont is to find the median of a sequence of numbers:

WITH s AS (SELECT * FROM generate_series(1, 10) AS t(n))
SELECT
    percentile_disc(0.5) WITHIN GROUP (ORDER BY n),
    percentile_cont(0.5) WITHIN GROUP (ORDER BY n)
FROM
    s;

 percentile_disc │ percentile_cont
─────────────────┼─────────────────
               5 │             5.5

The query demonstrates two types of medians:

  • percentile_disc returns the value that 50% of the table is less than. Notice that 5 is present in the table.
  • percentile_cont returns a value that 50% of the values are less than. The value 5.5 is not present in the table, it's the value between 5 and 6 which divides the values in the table in the middle (0.5).

Both functions can accept an array of values, in which case they will return a corresponding list of results.

Describing a Categorical Series

Previously we described a list of numbers. This time we want to describe a list of categorical values. For example, pandas will produce the following output:

>>> s = pd.Series(['a', 'a', 'b', 'c'])
>> s.describe()
count     4
unique    3
top       a
freq      2
dtype: object

Using SQL we can produce similar output:

WITH s AS (SELECT unnest(array['a', 'a', 'b', 'c']) AS v)
SELECT
    count(*),
    count(DISTINCT V) AS unique,
    mode() WITHIN GROUP (ORDER BY V) AS top
FROM
    s;

 count │ unique │ top
───────┼────────┼─────
     4 │      3 │ a

To calculate the number of unique values you used COUNT(DISTINCT ...). To get the value that appears most often in the series, i.e. the one with the highest frequency, you used another ordered set function called mode.


Subtotals

Another useful technique to analyze data is producing sub totals. We already saw how to apply aggregate functions on the table, but how about multiple aggregation levels in the same query?

Let's imagine a table of employees. For each employee we keep the name, the role and the department they work at:

WITH emp AS (
    SELECT * FROM (VALUES
        ('Haki',    'R&D',      'Manager'),
        ('Dan',     'R&D',      'Developer'),
        ('Jax',     'R&D',      'Developer'),
        ('George',  'Sales',    'Manager'),
        ('Bill',    'Sales',    'Developer'),
        ('David',   'Sales',    'Developer')
    ) AS t(
        name,       department,  role
    )
)
SELECT * FROM emp;

  name  │ department │   role
────────┼────────────┼───────────
 Haki   │ R&D        │ Manager
 Dan    │ R&D        │ Developer
 Jax    │ R&D        │ Developer
 George │ Sales      │ Manager
 Bill   │ Sales      │ Developer
 David  │ Sales      │ Developer

To find the number of employees with each role in each departments, you can use GROUP BY:

WITH emp AS ( /* ... */ )
SELECT department, role, COUNT(*)
FROM emp
GROUP BY department, role;

 department │   role    │ count
────────────┼───────────┼───────
 R&D        │ Developer │     2
 R&D        │ Manager   │     1
 Sales      │ Manager   │     1
 Sales      │ Developer │     2

Rollup

What if we want to also get the number of employees in each department, in all roles:

WITH emp AS ( /* ... */ )
SELECT department, role, COUNT(*)
FROM emp
GROUP BY ROLLUP(department), role;

 department │   role    │ count
────────────┼───────────┼───────
 R&D        │ Developer │     2
 R&D        │ Manager   │     1
 Sales      │ Manager   │     1
 Sales      │ Developer │     2
 R&D        │ ¤         │     3     -- <-- Total for R&D
 Sales      │ ¤         │     3     -- <-- Total for Sales

Notice the use of the subclause ROLLUP in the GROUP BY clause.

To add a subtotal for each department, we tell the database to "rollup" by the department field. The database then added two additional aggregate results, one for each department.

The database can actually produce sub totals in several levels. For example, to add the grand total of the number of employees in all departments, we can tell the database to "rollup" the role field as well:

WITH emp AS ( /* ... */ )
SELECT department, role, COUNT(*)
FROM emp
GROUP BY ROLLUP(department, role);

 department │   role    │ count
────────────┼───────────┼───────
 ¤          │ ¤         │     6    -- <-- Grand total
 R&D        │ Developer │     2
 R&D        │ Manager   │     1
 Sales      │ Manager   │     1
 Sales      │ Developer │     2
 R&D        │ ¤         │     3     -- <-- Total for R&D
 Sales      │ ¤         │     3     -- <-- Total for Sales

The query now includes several subtotals. To identify the aggregate level for each row, use the function GROUPING:

WITH emp AS ( /* ... */ )
SELECT
    department, role, COUNT(*),
    GROUPING(department)        AS department_subtotal,
    GROUPING(department, role)  AS grand_total
FROM emp
GROUP BY ROLLUP(department), role;

 department │   role    │ count │ department_subtotal │ grand_total
────────────┼───────────┼───────┼─────────────────────┼─────────────
 Sales      │ Developer │     2 │                   0 │           0
 Sales      │ Manager   │     1 │                   0 │           0
 R&D        │ Developer │     2 │                   0 │           0
 R&D        │ Manager   │     1 │                   0 │           0
 ¤          │ Manager   │     2 │                   1 │           2
 ¤          │ Developer │     4 │                   1 │           2

Cube

When talking about subtotals, or aggregates at multiple levels, OLAP usually comes to mind. OLAP cube is a technique where all the subtotals are pre-calculated to make retrieval faster. Using ROLLUP we can achieve this by providing all possible combinations, but there is an easier way to do that:

WITH emp AS ( /* ... */ )
SELECT department, role, COUNT(*)
FROM emp
GROUP BY CUBE(department, role);

 department │   role    │ count
────────────┼───────────┼───────
 ¤          │ ¤         │     6     -- <-- Grand Total
 R&D        │ Developer │     2
 R&D        │ Manager   │     1
 Sales      │ Manager   │     1
 Sales      │ Developer │     2
 R&D        │ ¤         │     3     -- <-- Subtotal for R&D department
 Sales      │ ¤         │     3     -- <-- Subtotal for Sales department
 ¤          │ Manager   │     2     -- <-- Subtotal for Manager role
 ¤          │ Developer │     4     -- <-- Subtotal for Developer role

CUBE generates subtotals for all possible combinations. In the examples above, using CUBE added an additional subtotal for each department.

Grouping Sets

Both CUBE and ROLLUP are syntactic sugar of GROUPING SETS:

WITH emp AS ( /* ... */ )
SELECT department, role, COUNT(*)
FROM emp
GROUP BY GROUPING SETS (
    (),                     -- <-- Grand total
    (role),                 -- <-- Subtotal by role
    (department),           -- <-- Subtotal by department
    (role, department)      -- <-- No aggregation, the row itself
);

 department │   role    │ count
────────────┼───────────┼───────
 ¤          │ ¤         │     6
 Sales      │ Developer │     2
 Sales      │ Manager   │     1
 R&D        │ Developer │     2
 R&D        │ Manager   │     1
 ¤          │ Manager   │     2
 ¤          │ Developer │     4
 R&D        │ ¤         │     3
 Sales      │ ¤         │     3

Using GROUPING SETS you can tell the database exactly which subtotals to generate. The query above is generating all possible combinations, so it's equivalent to CUBE.


Pivot Tables

Pivot tables are a technique to reshape data, and pandas includes a very powerful pivot_table function:

>>> import pandas as pd
>>> df = pd.DataFrame({
...   'name':       ['Haki', 'Dan', 'Jax', 'George', 'Bill', 'David'],
...   'department': ['R&D', 'R&D', 'R&D', 'Sales', 'Sales', 'Sales',],
...   'role':       ['Manager', 'Developer', 'Developer', 'Manager', 'Developer', 'Developer'],
... })
>>> pd.pivot_table(df, values='name', index='role', columns='department', aggfunc='count')

department  R&D  Sales
role
Developer     2      2
Manager       1      1

Conditional Expressions

To recreate the "pivot" above in SQL, do the following:

WITH emp AS (
    SELECT * FROM (VALUES
        ('Haki',    'R&D',      'Manager'),
        ('Dan',     'R&D',      'Developer'),
        ('Jax',     'R&D',      'Developer'),
        ('George',  'Sales',    'Manager'),
        ('Bill',    'Sales',    'Developer'),
        ('David',   'Sales',    'Developer')
    ) AS t(
        name,       department,  role
    )
)
SELECT
    role,
    SUM(CASE department WHEN 'R&D' THEN 1 ELSE 0 END) as "R&D",
    SUM(CASE department WHEN 'Sales' THEN 1 ELSE 0 END) as "Sales"
FROM
    emp
GROUP BY
    role;

   role    │ R&D │ Sales
───────────┼─────┼───────
 Manager   │   1 │     1
 Developer │   2 │     2

Using CASE, you constructed a "Conditional Expression" that returns the value 1 to sum only for a specific department. By adding a conditional expression for every department, you "reshaped" the data to a pivot table.

Pivot Table

Aggregate Expressions

Using CASE is flexible but it's a bit tedious. Applying conditions on aggregates is so useful that SQL added special syntax for it:

WITH emp AS ( /* ... */ )
SELECT
    role,
    COUNT(*) FILTER (WHERE department = 'R&D') as "R&D",
    COUNT(*) FILTER (WHERE department = 'Sales') as "Sales"
FROM
    emp
GROUP BY
    role;

This way of reshaping data into "pivot tables" is very common for visualizing and analyzing data. There is no doubt Pandas is much more flexible and comfortable when it comes to pivot tables. However, the process of turning rows into columns is also very common in ETL processes, where data is denormalized.

The bottom line is that Pandas may be better to quickly analyze and visualize small sets of data, but ETL processes may benefit from doing this process in the database using conditional or aggregate expressions.


Running and Cumulative Aggregation

Aggregations over a sliding window are very common, usually on a time series. For example, traders use moving averages as an indication of a stock's trend, running sums can be used to backtest an anomaly detection strategy and so on.

To illustrate, take this table listing daily temperatures:

WITH temperatures AS (
    SELECT * FROM (VALUES
        ('2021-01-01'::date, 10),
        ('2021-01-02'::date, 12),
        ('2021-01-03'::date, 13),
        ('2021-01-04'::date, 14),
        ('2021-01-05'::date, 18),
        ('2021-01-06'::date, 15),
        ('2021-01-07'::date, 16),
        ('2021-01-08'::date, 17)
    ) as t(t, c)
)
SELECT * FROM temperatures;

     t      │ c
────────────┼────
 2021-01-01 │ 10
 2021-01-02 │ 12
 2021-01-03 │ 13
 2021-01-04 │ 14
 2021-01-05 │ 18
 2021-01-06 │ 15
 2021-01-07 │ 16
 2021-01-08 │ 17

Window Functions

Say you want to compare each day to the hottest day ever:

WITH temperatures AS ( /* ... */ )
SELECT
    *,
    MAX(c) OVER (PARTITION BY 1) AS hottest_temperature
FROM
    temperatures;

     t      │ c  │ hottest_temperature
────────────┼────┼─────────────────────
 2021-01-01 │ 10 │                  18
 2021-01-02 │ 12 │                  18
 2021-01-03 │ 13 │                  18
 2021-01-04 │ 14 │                  18
 2021-01-05 │ 18 │                  18
 2021-01-06 │ 15 │                  18
 2021-01-07 │ 16 │                  18
 2021-01-08 │ 17 │                  18

By adding the OVER (PARTITION ...) clause to the aggregate function MAX, you turned it into a window function. Window functions operate on a set of rows determined by the PARTITION clause. Since you used a constant value PARTITION BY 1, the function operates on all the rows.

To complete the query, use the result of the window function in an expression:

WITH temperatures AS ( /* ... */ )
SELECT
    *,
    (c::float / MAX(c) OVER (PARTITION BY 1) - 1) * 100 AS compared_to_hottest_day
FROM
    temperatures;

     t      │ c  │ compared_to_hottest_day
────────────┼────┼────────────────────────
 2021-01-01 │ 10 │     -44.44444444444444
 2021-01-02 │ 12 │    -33.333333333333336
 2021-01-03 │ 13 │     -27.77777777777778
 2021-01-04 │ 14 │     -22.22222222222222
 2021-01-05 │ 18 │                      0
 2021-01-06 │ 15 │    -16.666666666666664
 2021-01-07 │ 16 │    -11.111111111111116
 2021-01-08 │ 17 │     -5.555555555555558

If you're are not sure what is the purpose of casting the temperature to float, make sure to read my tip about dividing integers in SQL.

Sliding Window

Comparing each day's temperature against the hottest temperature ever can be useful, but more often than not, you want to compare a value to a limited period, or in other words, a sliding window.

Sliding window

To find the highest temperature in the last three days for example, you can add a frame clause:

WITH temperatures AS ( /* ... */ )
SELECT
    *,
    MAX(c) OVER (
        ORDER BY t
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ) AS hottest_temperature_last_three_days
FROM
    temperatures;

     t      │ c  │ hottest_temperature_last_three_days
────────────┼────┼─────────────────────────────────────
 2021-01-01 │ 10 │                                  10
 2021-01-02 │ 12 │                                  12
 2021-01-03 │ 13 │                                  13
 2021-01-04 │ 14 │                                  14
 2021-01-05 │ 18 │                                  18
 2021-01-06 │ 15 │                                  18
 2021-01-07 │ 16 │                                  18
 2021-01-08 │ 17 │                                  17

Once again you used a window function, but this time you added a frame clause to it, stating the window should include 2 previous rows and the current one.

The frame syntax is very flexible, and it is not restricted to ROWS. The query above can be expressed using a range frame as well:

WITH temperatures AS ( /* ... */ )
SELECT
    *,
    MAX(c) OVER (
        ORDER BY t
        RANGE BETWEEN '2 days' PRECEDING AND '0 days' FOLLOWING
    ) AS hottest_temperature_last_three_days
FROM
    temperatures;

     t      │ c  │ hottest_temperature_last_three_days
────────────┼────┼─────────────────────────────────────
 2021-01-01 │ 10 │                                  10
 2021-01-02 │ 12 │                                  12
 2021-01-03 │ 13 │                                  13
 2021-01-04 │ 14 │                                  14
 2021-01-05 │ 18 │                                  18
 2021-01-06 │ 15 │                                  18
 2021-01-07 │ 16 │                                  18
 2021-01-08 │ 17 │                                  17

Notice how nice the RANGE syntax is... it reads like an actual sentence!


Linear Regression

Another common tool for analyzing data is linear regression.

Linear Regression

For example, performing linear regression using Pandas and Scipy:

>>> import pandas as pd
>>> import scipy.stats
>>> df = pd.DataFrame([[1.2, 1], [2, 1.8], [3.1, 2.9]])
>>> slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(df[0], df[1])
(1.0 -0.2000000000000004 1.0 9.003163161571059e-11 0.0)

Most developers probably don't expect the database to have statistical functions, but PostgreSQL does:

WITH t AS (SELECT * FROM (VALUES
    (1.2, 1.0),
    (2.0, 1.8),
    (3.1, 2.9)
) AS t(x, y))
SELECT
    regr_slope(y, x) AS slope,
    regr_intercept(y, x) AS intercept,
    sqrt(regr_r2(y, x)) AS r
FROM
    t;

       slope        │      intercept       │ r
────────────────────┼──────────────────────┼───
 1.0000000000000002 │ -0.20000000000000048 │ 1

Using statistical aggregate functions in PostgreSQL we got results similar to scipy.


Interpolation

Data cleaning is an important part of any data job, and handling missing values is a big part of that.

Fillna, forward fill and backward fill

Fill with Constant

The simplest way to fill in missing data is with some constant value. Using Pandas for example, this is done using the fillna function:

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(['A', 'B', np.NaN, 'D', np.NaN, np.NaN, 'G'])
>>> df.fillna('X')

   0
0  A
1  B
2  X
3  D
4  X
5  X
6  G

In SQL, if your missing values are NULL, you can use a condition expression CASE, or use the shorter COALESCE function:

WITH tb AS (
    SELECT * FROM (VALUES
        (1, 'A' ),
        (2, 'B' ),
        (3, null),
        (4, 'D' ),
        (5, null),
        (6, null),
        (7, 'G' )
    ) AS t(n, v)
)
SELECT
    n,
    coalesce(v, 'X') AS v
FROM
    tb;

 n │ v
───┼───
 1 │ A
 2 │ B
 3 │ X
 4 │ D
 5 │ X
 6 │ X
 7 │ G

The function COALESCE accepts any number of arguments and return the first one that is not NULL.

Back and Forward Fill

Filling values with constants is easy, but not always possible. Another common interpolation technique is filling empty values with previous or following non-missing values.

Pandas offers several variations on back and forward filling, for example:

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(['A', 'B', np.NaN, 'D', np.NaN, np.NaN, 'G'])
>>> df.fillna(method='ffill') # or df.ffill()
>>>    0
0  A
1  B
2  B
3  D
4  D
5  D
6  G

>>> df.fillna(method='bfill') # or df.bfill() or df.backfill()
   0
0  A
1  B
2  D
3  D
4  G
5  G
6  G

To achieve the same using SQL, you can use a subquery:

WITH tb AS (
    SELECT * FROM (VALUES
        (1, 'A' ),
        (2, 'B' ),
        (3, null),
        (4, 'D' ),
        (5, null),
        (6, null),
        (7, 'G' )
    ) AS t(n, v)
)
SELECT
    *,
    -- Find the next not null value
    coalesce(v, (
        SELECT v
        FROM tb AS tb_
        WHERE tb_.n < tb.n AND v IS NOT NULL
        ORDER BY n DESC
        LIMIT 1
    )) AS ffill_v,

    -- Find the previous not null value
    coalesce(v, (
        SELECT v
        FROM tb as tb_
        WHERE tb_.n > tb.n AND v IS NOT NULL
        ORDER BY n ASC
        LIMIT 1
    )) as bfill_v
FROM
    tb;

 n │ v │ ffill_v │ bfill_v
───┼───┼─────────┼─────────
 1 │ A │ A       │ A
 2 │ B │ B       │ B
 3 │ ¤ │ B       │ D
 4 │ D │ D       │ D
 5 │ ¤ │ D       │ G
 6 │ ¤ │ D       │ G
 7 │ G │ G       │ G

The SQL version is a bit longer, but it is fairly expressive, and it gives great flexibility.

NOTE: It's tempting to use the window function LEAD and LAG here, but these function can only be used when filling single row gaps. Once you have more than one consecutive missing row, LEAD and LAG may leave you with missing values.

Linear Interpolation

Another common interpolation technique for discrete data is linear interpolation.

>>> import pandas as pd
... import numpy as np
...
... df = pd.DataFrame([
...     (np.datetime64('2021-01-01'), 10),
...     (np.datetime64('2021-01-02'), 12),
...     (np.datetime64('2021-01-03'), np.NaN),
...     (np.datetime64('2021-01-04'), 14),
...     (np.datetime64('2021-01-05'), np.NaN),
...     (np.datetime64('2021-01-06'), np.NaN),
...     (np.datetime64('2021-01-07'), 18),
...     (np.datetime64('2021-01-08'), 15)
... ], columns=('t', 'c'))

>>> # Assume data is evenly distributed
>>> df['c'].interpolate('linear')
0    10.000000
1    12.000000
2    13.000000
3    14.000000
4    15.333333
5    16.666667
6    18.000000
7    15.000000

Linear interpolation works by filling missing values along a linear line between two known coordinates (x1, y1) and (x2, y2). In this case, The two known coordinates are the last and the next known date and temperatures.

You already found the next and previous known value for each row when you implemented back and forward fill:

WITH

temperatures AS (
    SELECT * FROM (VALUES
        ('2021-01-01'::date, 10),
        ('2021-01-02'::date, 12),
        ('2021-01-03'::date, null),
        ('2021-01-04'::date, 14),
        ('2021-01-05'::date, null),
        ('2021-01-06'::date, null),
        ('2021-01-07'::date, 18),
        ('2021-01-08'::date, 15)
    ) as t(t, c)
),

temperatures_with_previous_values AS (
    SELECT
        *,

        -- Last known temperature
        (
            SELECT
                array[extract('epoch' FROM t), c]
            FROM
                temperatures as temperatures_
            WHERE
                temperatures_.t < temperatures.t
                AND c IS NOT NULL
            ORDER BY
                temperatures_.t DESC
            LIMIT 1
        ) AS last_known_temperature,

        -- Next known temperature
        (
            SELECT
                array[extract('epoch' FROM t), c]
            FROM
                temperatures as temperatures_
            WHERE
                temperatures_.t > temperatures.t
                AND c IS NOT NULL
            ORDER BY
                temperatures_.t ASC
            LIMIT 1
        ) AS next_known_temperature
    FROM
        temperatures
)

SELECT * FROM temperatures_with_previous_values;

     t      │ c  │ last_known_temperature │ next_known_temperature
────────────┼────┼────────────────────────┼────────────────────────
 2021-01-01 │ 10 │ ¤                      │ {1609545600,12}
 2021-01-02 │ 12 │ {1609459200,10}        │ {1609718400,14}
 2021-01-03 │  ¤ │ {1609545600,12}        │ {1609718400,14}
 2021-01-04 │ 14 │ {1609545600,12}        │ {1609977600,18}
 2021-01-05 │  ¤ │ {1609718400,14}        │ {1609977600,18}
 2021-01-06 │  ¤ │ {1609718400,14}        │ {1609977600,18}
 2021-01-07 │ 18 │ {1609718400,14}        │ {1610064000,15}
 2021-01-08 │ 15 │ {1609977600,18}        │ ¤

There are two main differences here from what you've done before:

  1. You converted the date to a number: This is called "epoch", the number of seconds since 1970. To convert the date you used the function extract('epoch' FROM t).

  2. You keep two values from the previous and next row: To implement linear interpolation we need coordinates, which are both the date and the temperature. To return multiple values from the previous row, you constructed an array array[extract('epoch' FROM t), c].

To calculate a missing value with two known coordinates using linear interpolation, use the following formula:

y = y0 + (x - x0) * ((y1 - y0) / (x1 - x0))

You already have all the data available, so just organize it a bit:

WITH
temperatures AS ( /* ... */ ),

temperatures_with_previous_values AS ( /* ... */ ),

-- This step is just for convenience
temperatures_prep AS (
    SELECT
        t,
        c,
        extract('epoch' from t) as x,
        last_known_temperature[1] as x0,
        last_known_temperature[2] as y0,
        next_known_temperature[1] as x1,
        next_known_temperature[2] as y1
    FROM
        temperatures_with_previous_values
)
SELECT t, c, x, x0, y0, x1, y1 FROM temperatures_prep;

     t      │ c  │     x      │     x0     │ y0 │     x1     │ y1
────────────┼────┼────────────┼────────────┼────┼────────────┼────
 2021-01-01 │ 10 │ 1609459200 │          ¤ │  ¤ │ 1609545600 │ 12
 2021-01-02 │ 12 │ 1609545600 │ 1609459200 │ 10 │ 1609718400 │ 14
 2021-01-03 │  ¤ │ 1609632000 │ 1609545600 │ 12 │ 1609718400 │ 14
 2021-01-04 │ 14 │ 1609718400 │ 1609545600 │ 12 │ 1609977600 │ 18
 2021-01-05 │  ¤ │ 1609804800 │ 1609718400 │ 14 │ 1609977600 │ 18
 2021-01-06 │  ¤ │ 1609891200 │ 1609718400 │ 14 │ 1609977600 │ 18
 2021-01-07 │ 18 │ 1609977600 │ 1609718400 │ 14 │ 1610064000 │ 15
 2021-01-08 │ 15 │ 1610064000 │ 1609977600 │ 18 │          ¤ │  ¤

Now that you have all the data neatly organized, you can use the formula to calculate missing values:

WITH

temperatures AS ( /* ... */ ),

temperatures_with_previous_values AS ( /* ... */ ),

temperatures_prep AS ( /* ... */ )

SELECT
    t,
    c,
    CASE
        WHEN c IS NOT NULL THEN c
        ELSE y0 + (x - x0) * ((y1 - y0) / (x1 - x0))
    END AS interpolated_c
FROM
    temperatures_prep
;

     t      │ c  │   interpolated_c
────────────┼────┼────────────────────
 2021-01-01 │ 10 │                 10
 2021-01-02 │ 12 │                 12
 2021-01-03 │  ¤ │                 13
 2021-01-04 │ 14 │                 14
 2021-01-05 │  ¤ │ 15.333333333333334
 2021-01-06 │  ¤ │ 16.666666666666668
 2021-01-07 │ 18 │                 18
 2021-01-08 │ 15 │                 15

And there it is, the missing temperatures were filled using linear interpolation.

The complete query

WITH

temperatures AS (
    SELECT * FROM (VALUES
        ('2021-01-01'::date, 10),
        ('2021-01-02'::date, 12),
        ('2021-01-03'::date, null),
        ('2021-01-04'::date, 14),
        ('2021-01-05'::date, null),
        ('2021-01-06'::date, null),
        ('2021-01-07'::date, 18),
        ('2021-01-08'::date, 15)
    ) as t(t, c)
),

temperatures_with_previous_values AS (
    SELECT
        *,

        -- Last known temperature
        (
            SELECT
                ARRAY[EXTRACT('epoch' FROM t), c]
            FROM
                temperatures as temperatures_
            WHERE
                temperatures_.t < temperatures.t
                AND c IS NOT NULL
            ORDER BY
                temperatures_.t DESC
            LIMIT 1
        ) AS last_known_temperature,

        -- Next known temperature
        (
            SELECT
                ARRAY[EXTRACT('epoch' FROM t), c]
            FROM
                temperatures as temperatures_
            WHERE
                temperatures_.t > temperatures.t
                AND c IS NOT NULL
            ORDER BY
                temperatures_.t ASC
            LIMIT 1
        ) AS next_known_temperature
    FROM
        temperatures
),

-- This step is just for convenience
temperatures_prep AS (
    SELECT
        t,
        c,
        extract('epoch' from t) as x,
        last_known_temperature[1] as x0,
        last_known_temperature[2] as y0,
        next_known_temperature[1] as x1,
        next_known_temperature[2] as y1
    FROM
        temperatures_with_previous_values
)

SELECT
    t,
    c,
    CASE
        WHEN c IS NOT NULL THEN c
        ELSE y0 + (x - x0) * ((y1 - y0) / (x1 - x0))
    END AS interpolated_c
FROM
    temperatures_prep;

Binning

Binning, or "bucketing", is a technique to group values together.

Binning

Custom Binning

Custom binning is most common for categorical data or for discrete data when ranges are pre-determined.

Image you have a table with student grades, and you want to classify them to letter grades A-F:

WITH grades AS (
    SELECT round(70 + sin(n) * 30)::int AS grade
    FROM generate_series(1, 100) AS n
)
SELECT
    CASE
        WHEN grade < 60 THEN 'F'
        WHEN grade < 70 THEN 'D'
        WHEN grade < 80 THEN 'C'
        WHEN grade < 90 THEN 'B'
        ELSE 'A'
    END AS letter_grade,
    COUNT(*)
FROM
    grades
GROUP BY
    letter_grade
ORDER BY
    letter_grade;

 letter_grade │ count
──────────────┼───────
 A            │    29
 B            │    10
 C            │    12
 D            │    10
 F            │    39

Custom binning can also use expressions to categorize data into custom groups. In the american grade system for example, the letter grade can also be calculated based on the percentile, and not the absolute grade:

WITH grades AS (
    SELECT round(70 + sin(n) * 30)::int AS grade
    FROM generate_series(1, 100) AS n
),

percent_grades AS (
    SELECT percent_rank() OVER (ORDER BY grade) as percent_grade
    FROM grades
)

SELECT
    CASE
        WHEN percent_grade < 0.6 THEN 'F'
        WHEN percent_grade < 0.7 THEN 'D'
        WHEN percent_grade < 0.8 THEN 'C'
        WHEN percent_grade < 0.9 THEN 'B'
        ELSE 'A'
    END AS letter_grade,
    COUNT(*)
FROM
    percent_grades
GROUP BY
    letter_grade
ORDER BY
    letter_grade;

To find the relative grade of every student based on the grades of all other students, you used the window function percent_rank. The function returns a value between 0 and 1 that represents the rank of the current row relative to all other rows.

Custom binning is mostly useful for when the data is familiar, or within a known set of values. When exploring unknown or unbound sets of data there are other binning techniques you can use.

Equal Height Binning

Say you need to divide students to groups based on their grade, and you want every group to have roughly the same number of students. To achieve this, PostgreSQL provides a function called NTILE:

WITH grades AS (
    SELECT round(70 + sin(n) * 30)::int AS grade
    FROM generate_series(1, 100) AS n
),

grades_with_tiles AS (
    SELECT
        *,
        ntile(10) OVER (ORDER BY grade) AS bucket
    FROM
        grades
)
SELECT
    min(grade) AS from_grade,
    max(grade) AS to_grade,
    count(*) AS cnt,
    bucket
FROM
    grades_with_tiles
GROUP BY
    bucket
ORDER BY
    from_grade;

 from_grade │ to_grade │ cnt │ bucket
────────────┼──────────┼─────┼────────
         40 │       41 │  10 │      1
         41 │       45 │  10 │      2
         47 │       53 │  10 │      3
         53 │       61 │  10 │      4
         61 │       70 │  10 │      5
         71 │       79 │  10 │      6
         79 │       87 │  10 │      7
         89 │       95 │  10 │      8
         95 │       99 │  10 │      9
         99 │      100 │  10 │     10

Divding values into bins or buckets with roughly the same frequency is called "Equal Height Binning". Notice how each group holds exactly 10 rows.

The function NTILE is a window function. It accepts the number of buckets, in this case 10, and an order by clause in which to divide the range by. Window functions can't be used as a group by key, so you need to use either a subquery or a CTE to add the "bucket" field.

Equal Width Binning

So far you divided students to groups based on arbitrary letter grades (custom binning) and to equally sizes groups based on their grades (equal width binning). None of these grouping technics gives you a good sense of the data distribution. One way to visualize the data and get a sense of how grades are distributed is using a histogram.

Histogram

To draw a histogram you need to divide grades into equal width ranges. Grades range from 0 to 100, so you can split the range to 10 bars of 10 each:

WITH grades AS (
    SELECT round(70 + sin(n) * 30)::int AS grade
    FROM generate_series(1, 100) AS n
)
SELECT
    floor((grade - 1) / 10) as bucket,
    min(grade) from_grade,
    max(grade) to_grade,
    count(*)
FROM
    grades
GROUP BY
    bucket
ORDER BY
    bucket;

 bucket │ from_grade │ to_grade │ count
────────┼────────────┼──────────┼───────
      3 │         40 │       40 │     7
      4 │         41 │       50 │    20
      5 │         51 │       58 │    12
      6 │         61 │       70 │    11
      7 │         71 │       79 │    11
      8 │         81 │       90 │    14
      9 │         92 │      100 │    25

To assign each grade to the right bucket we used a little arithmetics. This worked out nicely because the arithmetics here are fairly simple, but what if you wanted smaller buckets? Say 20 buckets of width 5? or 25 buckets of width 4? That would have made the calculation more complicated.

To simplify the task of assigning values into equal width buckets within a predefined range, PostgreSQL provides the function width_bucket:

WITH grades AS (
    SELECT round(70 + sin(n) * 30)::int AS grade
    FROM generate_series(1, 100) AS n
)
SELECT
    width_bucket(grade, 0 ,101, 20) as bucket,
    (width_bucket(grade, 0 ,101, 20) - 1) * 5 as low_bound,
    width_bucket(grade, 0 ,101, 20) * 5 as high_bound,
    count(*)
FROM
    grades
GROUP BY
    bucket
ORDER BY
    bucket;

bucket │ low_bound │ high_bound │ count
───────┼───────────┼────────────┼───────
     8 │        35 │         40 │     7
     9 │        40 │         45 │    13
    10 │        45 │         50 │     7
    11 │        50 │         55 │     8
    12 │        55 │         60 │     4
    13 │        60 │         65 │     7
    14 │        65 │         70 │     4
    15 │        70 │         75 │     7
    16 │        75 │         80 │     4
    17 │        80 │         85 │     6
    18 │        85 │         90 │     8
    19 │        90 │         95 │     7
    20 │        95 │        100 │    18

Ths function width_bucket accepts the value to assign, two arguments for the lower and higher bounds of the range, and the number of buckets to divide the range into.

You should be careful with the values you set for the higher and lower bounds. The higher bound of the range is exclusive, meaning, if you set the higher bound to 100, grades that equal 100 will be considered out of range, and will result in an additional bucket. This is why the query above uses 101 as the higher bound.

To calculate the higher bound of each bucket, we multiply the index of the bucket by the width. To get the lower bound, we multiple the width by the index minus one, or in other words, the higher bound of the previous bucket.

Histograms are great for visualization, but if you try to draw a histogram from the result above you won't be able to get a real sense of the distribution because you might have gaps. Notice for example, how the range above starts with bucket 9, which is not the first bucket. This is because in our grades table, no one got a grade which is less than 40.

The function width_bucket is useful, but we already do most of the hard work, so might as well generate the buckets on our own using generate_series:

SELECT
    bucket,
    (bucket - 1) * 5 + 1 AS low_bound,
    bucket * 5 AS high_bound
FROM
    generate_series(1, 20) AS bucket;

bucket │ low_bound │ high_bound
───────┼───────────┼────────────
     1 │         1 │          5
     2 │         6 │         10
     3 │        11 │         15
     4 │        16 │         20
     5 │        21 │         25
     6 │        26 │         30
     7 │        31 │         35
     8 │        36 │         40
     9 │        41 │         45
    10 │        46 │         50
    11 │        51 │         55
    12 │        56 │         60
    13 │        61 │         65
    14 │        66 │         70
    15 │        71 │         75
    16 │        76 │         80
    17 │        81 │         85
    18 │        86 │         90
    19 │        91 │         95
    20 │        96 │        100

The query generates 20 ranges of 5 within a range 1 to 100. To create the histogram, use this table as an axis and join it to the grades:

WITH grades AS (
    SELECT round(70 + sin(n) * 30)::int AS grade
    FROM generate_series(1, 100) AS n
),

buckets AS (
    SELECT
        bucket,
        (bucket - 1) * 5 + 1 AS low_bound,
        bucket * 5 AS high_bound
    FROM
        generate_series(1, 20) AS bucket
)

SELECT
    bucket,
    low_bound,
    high_bound,
    COUNT(grade) AS cnt
FROM
    buckets
    LEFT JOIN grades ON grade BETWEEN low_bound AND high_bound
GROUP BY
    bucket, low_bound, high_bound
ORDER BY
    bucket;

bucket │ low_bound │ high_bound │ cnt
───────┼───────────┼────────────┼─────
     1 │         1 │          5 │   0
     2 │         6 │         10 │   0
     3 │        11 │         15 │   0
     4 │        16 │         20 │   0
     5 │        21 │         25 │   0
     6 │        26 │         30 │   0
     7 │        31 │         35 │   0
     8 │        36 │         40 │   7
     9 │        41 │         45 │  13
    10 │        46 │         50 │   7
    11 │        51 │         55 │   8
    12 │        56 │         60 │   4
    13 │        61 │         65 │   7
    14 │        66 │         70 │   4
    15 │        71 │         75 │   7
    16 │        76 │         80 │   4
    17 │        81 │         85 │   6
    18 │        86 │         90 │   8
    19 │        91 │         95 │   7
    20 │        96 │        100 │  18

To make sure you don't have any gaps in the data, you LEFT JOINed the grades to the generated axis table buckets. As a result, some rows do not have a value. Using COUNT(*) count rows, so buckets with no grades return 1. To overcome that, count only rows with grades using COUNT(grade).

To finish off with a bang, you can enhance your query with a little ascii chart to display the histogram straight in the terminal:

WITH grades AS (
    SELECT round(70 + sin(n) * 30)::int AS grade
    FROM generate_series(1, 100) AS n
),

buckets AS (
    SELECT
        bucket,
        (bucket - 1) * 5 + 1 AS low_bound,
        bucket * 5 AS high_bound
    FROM
        generate_series(1, 20) AS bucket
)

SELECT
    low_bound || ' - ' || high_bound as bounds,
    COUNT(grade) AS cnt,
    repeat('■', COUNT(grade)::int) as chart
FROM
    buckets
    LEFT JOIN grades ON grade BETWEEN low_bound AND high_bound
GROUP BY
    bucket, low_bound, high_bound
ORDER BY
    bucket;

 bounds  │ cnt │       chart
─────────┼─────┼────────────────────
1 - 5    │   0 │
6 - 10   │   0 │
11 - 15  │   0 │
16 - 20  │   0 │
21 - 25  │   0 │
26 - 30  │   0 │
31 - 35  │   0 │
36 - 40  │   7 │ ■■■■■■■
41 - 45  │  13 │ ■■■■■■■■■■■■■
46 - 50  │   7 │ ■■■■■■■
51 - 55  │   8 │ ■■■■■■■■
56 - 60  │   4 │ ■■■■
61 - 65  │   7 │ ■■■■■■■
66 - 70  │   4 │ ■■■■
71 - 75  │   7 │ ■■■■■■■
76 - 80  │   4 │ ■■■■
81 - 85  │   6 │ ■■■■■■
86 - 90  │   8 │ ■■■■■■■■
91 - 95  │   7 │ ■■■■■■■
96 - 100 │  18 │ ■■■■■■■■■■■■■■■■■■

And there you have it! A histogram with SQL, right there in your terminal...


Take Away

If there is one thing you take away from this article, it should be this - use the best tool for the job!. Pandas is great, and SQL is also great. Each has its strengths and weaknesses, and you have a better chance of creating an optimal data pipeline if know both!




Similar articles