Post

Optimizing Queries in PostgreSQL: Partitioning and Indexing

An in-depth look at practical query optimization techniques in PostgreSQL, focusing on partitioning and indexing to improve query performance.

Optimizing Queries in PostgreSQL: Partitioning and Indexing

In this article, we will explore practical query optimization techniques in PostgreSQL using a real-world dataset of stock prices.

There are two sides to query optimization: the database side and the user side. We will focus on the database side, specifically on partitioning and indexing, and demonstrate their impact on query performance.

For this article, we will use a stock price dataset from the GitHub repository FNSPID_Financial_News_Dataset. This dataset contains price data for various stocks over a period of time.

Schema of the stock_prices Table

Here is the schema of the stock_prices table that we are working on:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
CREATE TABLE stock_prices (
    stock_id VARCHAR,
    date DATE,
    open NUMERIC,
    high NUMERIC,
    low NUMERIC,
    close NUMERIC,
    adj_close NUMERIC,
    volume NUMERIC,
    sentiment_gpt NUMERIC,
    news_flag NUMERIC,
    scaled_sentiment NUMERIC
);

Partitioning

Partitioning involves dividing a large table into smaller, more manageable pieces called partitions. This can improve query performance by allowing the database engine to scan only the relevant partitions.

Before Partitioning

Let’s run a query to fetch data for a specific date range and analyze its performance without partitioning.

1
2
3
4
5
6
7
8
9
10
11
EXPLAIN ANALYZE
SELECT * FROM stock_prices WHERE date >= '2023-01-01';

                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Seq Scan on stock_prices  (cost=0.00..5875.21 rows=11644 width=96) (actual time=0.051..17.036 rows=11772 loops=1)
   Filter: (date >= '2023-01-01'::date)
   Rows Removed by Filter: 116165
 Planning Time: 0.157 ms
 Execution Time: 17.522 ms
(5 rows)

Implementing Partitioning

The results clearly demonstrate the inefficiency of scanning the entire table for queries targeting specific date ranges. Let’s explore how partitioning can dramatically improve performance.

First, let’s create a table named stock_prices_partitioned to enable partitioning:

1
2
3
4
5
6
7
8
9
10
11
12
13
CREATE TABLE stock_prices_partitioned (
    stock_id VARCHAR,
    date DATE,
    open NUMERIC,
    high NUMERIC,
    low NUMERIC,
    close NUMERIC,
    adj_close NUMERIC,
    volume NUMERIC,
    sentiment_gpt NUMERIC,
    news_flag NUMERIC,
    scaled_sentiment NUMERIC
) PARTITION BY RANGE (date);

Use the script below to dynamically create partitions for each year.

1
2
3
4
5
6
7
8
9
10
DATABASE="stock_data"
USER="sami"

for year in {2009..2023}; do
    next_year=$((year + 1))
    psql -U $USER -d $DATABASE -c "
    CREATE TABLE stock_prices_${year} PARTITION OF stock_prices_partitioned
    FOR VALUES FROM ('$year-01-01') TO ('$next_year-01-01');"
done

Migrating Data to Partitioned Table

We will migrate the data from the original table to the partitioned table and verify the data count to ensure the migration was successful.

1
2
INSERT INTO stock_prices_partitioned
SELECT * FROM stock_prices;

Check Partition Distribution

To verify the distribution of data across partitions, we will run a query to count the number of rows in each partition. This helps us understand how the data is distributed and ensures that the partitions are being used correctly.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
SELECT tableoid::regclass AS partition_name, COUNT(*) 
FROM stock_prices_partitioned 
GROUP BY tableoid;

  partition_name   | count 
-------------------+-------
 stock_prices_2009 |   970
 stock_prices_2010 |  6184
 stock_prices_2011 |  6552
 stock_prices_2012 |  6675
 stock_prices_2013 |  7524
 stock_prices_2014 |  7633
 stock_prices_2015 |  8469
 stock_prices_2016 |  9075
 stock_prices_2017 |  9542
 stock_prices_2018 |  9888
 stock_prices_2019 | 10700
 stock_prices_2020 | 10753
 stock_prices_2021 | 10680
 stock_prices_2022 | 11520
 stock_prices_2023 | 11772
(15 rows)

After Partitioning

Run the same query again and analyze its performance with partitioning

1
2
3
4
5
6
7
8
9
10
11
stock_data=# EXPLAIN ANALYZE
SELECT * FROM stock_prices_partitioned WHERE date >= '2023-01-01';
                                                                  QUERY PLAN                                                                   
-----------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on stock_prices_2023 stock_prices_partitioned  (cost=0.00..343.15 rows=11771 width=97) (actual time=0.057..2.595 rows=11772 loops=1)
   Filter: (date >= '2023-01-01'::date)
 Planning Time: 0.862 ms
 Execution Time: 3.200 ms
(4 rows)


Using partitioning significantly reduces execution time, improving performance from 17.522 ms to 3.200 ms achieving roughly 81% reduction in execution time.

Indexing

Indexing allows the database engine to locate data quickly without scanning the entire table. Since stock_id is already a primary key, it is indexed by default. Let’s create an additional index on the date column to optimize queries that filter by date.

Before Indexing

Let’s run a query to fetch data for a specific date and analyze its performance without indexing.

1
2
3
4
5
6
7
8
9
10
stock_data=# EXPLAIN ANALYZE
SELECT * FROM stock_prices WHERE stock_id = 'AAPL';
                                                  QUERY PLAN                                                   
---------------------------------------------------------------------------------------------------------------
 Seq Scan on stock_prices  (cost=0.00..5875.21 rows=409 width=96) (actual time=0.040..23.852 rows=387 loops=1)
   Filter: ((stock_id)::text = 'AAPL'::text)
   Rows Removed by Filter: 127550
 Planning Time: 0.692 ms
 Execution Time: 23.907 ms
(5 rows)

Adding an Index

The reason for the above query’s inefficiency is that it performs a full table scan, which is a costly operation. To address this, let’s create an index on the stock_id column to optimize the query.

1
CREATE INDEX idx_stock_id ON stock_prices(stock_id);

After Indexing

Run the same query again and analyze its performance with the index.

1
2
3
4
5
6
7
8
9
stock_data=# EXPLAIN ANALYZE                                     
SELECT * FROM stock_prices WHERE stock_id = 'AAPL';
                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Index Scan using idx_stock_id on stock_prices  (cost=0.29..28.45 rows=409 width=96) (actual time=0.052..0.117 rows=387 loops=1)
   Index Cond: ((stock_id)::text = 'AAPL'::text)
 Planning Time: 0.539 ms
 Execution Time: 0.164 ms
(4 rows)

Using an index significantly improves query performance, reducing execution time from 23.907 ms (sequential scan) to 0.164 ms (index scan), achieving a 99.31% reduction in execution time.

Conclusion

In this article, we explored practical query optimization techniques in PostgreSQL using a real-world dataset of stock prices. We focused on partitioning and indexing, and demonstrated their impact on query performance. By implementing these strategies, you can significantly improve the performance of your queries, reducing both query speed and cost.

This article was inspired by the dataset from the GitHub repository FNSPID_Financial_News_Dataset.

Happy querying!

This post is licensed under CC BY 4.0 by the author.