How to speed up queries

Tips and tricks for improving query performance: indexes, materialized views, and more.

Let us first EXPLAIN

Every data warehouse has a query planner. This query planner figures out the best way of running the query based on your data and some pre-sets. For example, the query planner will decide whether it should read all the rows in the data warehouse or narrow its search to a specific set of rows based on a certain condition.

Let’s create some fake data in a Postgres data warehouse (other databases might work, though you may have to adjust the code). Run the following statement inside (you’ll need write permissions for the data warehouse).

DROP TABLE IF EXISTS data_table;
CREATE TABLE data_table (
    id SERIAL PRIMARY KEY,
    date_field DATE NOT NULL,
    category TEXT NOT NULL CHECK (category IN ('A', 'B', 'C')),
    numeric_value NUMERIC NOT NULL
);

INSERT INTO data_table (date_field, category, numeric_value)
SELECT
    CURRENT_DATE - (random() * 365)::INT, -- Random date within the last year
    CASE FLOOR(random() * 3)
        WHEN 0 THEN 'A'
        WHEN 1 THEN 'B'
        ELSE 'C'
    END, -- Random category
    (random() * 1000)::NUMERIC -- Random numeric value between 0 and 1000
FROM generate_series(1, 50000000);

This query might take a while. To check if it worked, run:

SELECT
  *
FROM
  data_table
LIMIT
  10;

You’ll get something like:

| id  | date_field        | category | numeric_value |
| --- | ----------------- | -------- | ------------- |
| 1   | August 2, 2024    | A        | 874.17        |
| 2   | October 31, 2024  | A        | 762.03        |
| 3   | August 23, 2024   | A        | 718.73        |
| 4   | February 6, 2025  | C        | 334.45        |
| 5   | August 28, 2024   | A        | 59.4          |
| 6   | August 8, 2024    | A        | 972.74        |
| 7   | October 18, 2024  | B        | 296.99        |
| 8   | November 27, 2024 | B        | 858.26        |
| 9   | September 5, 2024 | C        | 137.84        |
| 10  | February 24, 2025 | C        | 701.68        |

Let’s do a simple sum of numeric_value by category and see how long it takes. Here’s the query:

SELECT
  category AS "Category",
  SUM(numeric_value) AS "Sum"
FROM
  data_table
GROUP BY
  category
ORDER BY
  category ASC

Running that query on PostgreSQL 17 with 2 CPU cores and 2GB of RAM took about 8 seconds.

Now let’s run EXPLAIN on the query. EXPLAIN will output the plan for the query, along with its costs and other data about the query.

EXPLAIN
SELECT
  category AS "Category",
  SUM(numeric_value) AS "Sum"
FROM
  data_table
GROUP BY
  category
ORDER BY
  category ASC

You’ll see the following output (which can vary depending on your current deployment):

QUERY PLAN
Finalize GroupAggregate  (cost=631972.08..631972.87 rows=3 width=34)
  Group Key: category
  ->  Gather Merge  (cost=631972.08..631972.78 rows=6 width=34)
        Workers Planned: 2
        ->  Sort  (cost=630972.06..630972.07 rows=3 width=34)
              Sort Key: category
              ->  Partial HashAggregate  (cost=630972.00..630972.04 rows=3 width=34)
                    Group Key: category
                    ->  Parallel Seq Scan on data_table  (cost=0.00..526805.33 rows=20833333 width=13)
JIT:
  Functions: 7
  Options: Inlining true, Optimization true, Expressions true, Deforming true

This output tells us:

  • The query processor will parallelize this query on the engine’s available 2 cores (Workers Planned: 2).
  • The most costly operation was the Parallel Seq Scan one (0.00..526805.33). Seq Scan tells us that it has to scan the full table.
  • The other operations had negligible impact on overall query time.

So how can we make this query stop obliterating the database when people check their dashboard each morning? Well, since the query has to scan all the rows to count them, you can’t really speed up the query without faster disks or more CPU cores.

What we can do is try to get more value out of each run by caching results or materializing views.

Create materialized views for expensive joins, CTEs, and views

Materialized views are pre-computed queries that you can save to the disk in your data warehouse. Materialized views take the strain off your database by executing long-running queries on a certain cadence, and storing those results for people to query.

Let’s create a materialized view with the query that we did before:

CREATE MATERIALIZED VIEW preprocessed_sum_of_numeric_value_by_category AS
SELECT
  category AS "Category",
  SUM(numeric_value) AS "Sum"
FROM
  data_table
GROUP BY
  category
ORDER BY
  category ASC

Now all we need to do to get the sum for each category is to select from this materialized view:

SELECT
  *
FROM
  preprocessed_sum_of_numeric_value_by_category;

Querying the materialized view could get the time down from seconds to milliseconds.

If your data_table continues to grow, you’ll need to update your materialized view. To recompute the results, you’ll need a script to run REFRESH MATERIALIZED VIEW preprocessed_sum_of_numeric_value_by_category on a certain cadence. How often you refresh the results depends on 1) how often the underlying data changes and 2) how often people check the results.

Here are some situations where you should consider materializing views:

Joins

Table joins are one of the most beautiful things in the relational database world. Joins allow you to organize your business in modular sections that can be put together with just a simple keyword. But joins can also be pretty dangerous: a join against a mammoth table can really bog down your data warehouse. If you need to join more than two or three tables each time you make a query, consider materializing the base query with the joins into a table.

CTEs

CTEs are great for modularizing your code. But your data warehouse will have to compute these CTEs every single time you run these queries. If you find yourself running the same queries full of CTEs every day, it’s time you model your data in such a way that your queries are just a simple SELECT * FROM table where “table” is a materialized view or a table you created with the results of these CTEs.

Views (not materialized views)

Instead of having to write one hundred lines of SQL each time you do a query, you just simply select the rows from a view you saved and voilá: you get the same results. Views are just abstractions; they don’t give any performance benefits. Use views when you want to reduce the complexity of queries. If you’re using the results often, consider materializing the view.

Index frequently queried columns

You might have heard about indexes at some point. Your queries are slow, a person knowledgeable in databases tells you to create an index, and boom: queries are faster now. An index is like a lookup table that helps your database quickly find specific rows without having to scan through the entire table. Just like how a book’s index helps you find topics without reading every page, a database index points directly to where data is stored. They work well when you need to look up some of the data, but not all of the data.

Let’s go back to the previous example: the SUM(numeric_value). Would an index make that question faster? Definitely not, as that question still has to go over the whole table to calculate the result.

Now, let’s imagine that someone wants to do a query that only looks up part of a table: the sum of numeric value by category, but only for the last 60 days. Let’s try it:

SELECT
  category AS "Category",
  SUM(numeric_value) AS "Sum"
FROM
  data_table
WHERE
  date_field <= CURRENT_DATE + INTERVAL '1' DAY
  AND date_field >= CURRENT_DATE - INTERVAL '60' DAY
GROUP BY
  category
ORDER BY
  category ASC

That took a while! Let’s run the EXPLAIN to see what’s going on here:

QUERY PLAN
Finalize GroupAggregate  (cost=857409.42..857410.21 rows=3 width=34)
  Group Key: category
  ->  Gather Merge  (cost=857409.42..857410.12 rows=6 width=34)
        Workers Planned: 2
        ->  Sort  (cost=856409.40..856409.41 rows=3 width=34)
              Sort Key: category
              ->  Partial HashAggregate  (cost=856409.34..856409.38 rows=3 width=34)
                    Group Key: category
                    ->  Parallel Seq Scan on data_table  (cost=0.00..839305.33 rows=3420801 width=13)
                          Filter: ((date_field <= (CURRENT_DATE + '1 day'::interval day)) AND (date_field >= (CURRENT_DATE - '60 days'::interval day)))
JIT:
  Functions: 9
  Options: Inlining true, Optimization true, Expressions true, Deforming true

Once again, we see the costly sequential scan operation. Would an index on the date_field help?

Creating an index

Let’s create an index on the date_field column:

CREATE INDEX idx_data_table_date
ON data_table (date_field);

The database needs some time to create an index, and it’ll lock down the table from any write operations until it finishes the index.

You can check the index with a query:

SELECT
    tablename,
    indexname,
    indexdef
FROM
    pg_indexes
WHERE tablename = 'data_table'

Writing a query that takes advantage of an index

Once the index is up, let’s run the query again to see if it uses the index and cuts down on the time it takes to return results.

If you run EXPLAIN again, you’ll see that the index wasn’t even used, and you still get the parallel sequential scan (and the query is still slow).

Let’s see if we can get the database to use the index with specific, not relative dates.

EXPLAIN
SELECT
  category AS "Category",
  SUM(numeric_value) AS "Sum"
FROM
  data_table
WHERE
  date_field BETWEEN DATE '2025-01-01' AND DATE '2025-03-01'
GROUP BY
  category
ORDER BY
  category ASC;

Which gives us:

QUERY PLAN
Finalize GroupAggregate  (cost=649104.46..649105.24 rows=3 width=34)
  Group Key: category
  ->  Gather Merge  (cost=649104.46..649105.16 rows=6 width=34)
        Workers Planned: 2
        ->  Sort  (cost=648104.44..648104.44 rows=3 width=34)
              Sort Key: category
              ->  Partial HashAggregate  (cost=648104.38..648104.41 rows=3 width=34)
                    Group Key: category
                    ->  Parallel Seq Scan on data_table  (cost=0.00..630972.00 rows=3426475 width=13)
                          Filter: ((date_field >= '2025-01-01'::date) AND (date_field <= '2025-03-01'::date))
JIT:
  Functions: 9
  Options: Inlining true, Optimization true, Expressions true, Deforming true

Using specific dates cut the cost from cost=0.00..839305.33 to cost=0.00..630972.00.

But isn’t the engine still doing a full table scan? Yes, it is (Parallel Seq Scan on data_table). To get the database to use the index, we’ll need to narrow the time span. Let’s change the BETWEEN condition to narrow the time span to two weeks, say 2025-03-01 to 2025-03-15. Run EXPLAIN again. You’ll see something like:

QUERY PLAN
Finalize GroupAggregate  (cost=633296.94..633297.72 rows=3 width=34)
  Group Key: category
  ->  Gather Merge  (cost=633296.94..633297.64 rows=6 width=34)
        Workers Planned: 2
        ->  Sort  (cost=632296.92..632296.92 rows=3 width=34)
              Sort Key: category
              ->  Partial HashAggregate  (cost=632296.86..632296.89 rows=3 width=34)
                    Group Key: category
                    ->  Parallel Bitmap Heap Scan on data_table  (cost=27900.15..628034.34 rows=852503 width=13)
                          Recheck Cond: ((date_field >= '2025-03-01'::date) AND (date_field <= '2025-03-15'::date))
                          ->  Bitmap Index Scan on idx_data_table_date  (cost=0.00..27388.65 rows=2046008 width=0)
                                Index Cond: ((date_field >= '2025-03-01'::date) AND (date_field <= '2025-03-15'::date))
JIT:
  Functions: 9
  Options: Inlining true, Optimization true, Expressions true, Deforming true

See the Index Cond? Now we’re taking advantage of the index we created. What was previously a sequential scan is now an index scan.

Seems magical, right? Well… not so much. Because as you’ve already seen, if you query for a long enough time span, you’ll still trigger a full table scan. And there’s a reason for that.

Let’s imagine the following scenario: you have a book with a thousand pages. You’re an avid reader who reads one hundred pages per day. Now let’s say that you want to find specific lines in a book. You can’t use an index to find those lines; you’d need to scan the entire book. Now let’s imagine that, instead of finding the specific lines, you need to find the chapter that covers a topic: that’s a case where an index comes in handy.

Wait, but these are computers, right? Like, incredibly fast readers? Can’t I just make an index for every column so it covers all the possible cases? Well, you definitely could, but all those indexes will hinder the table’s write capacity: every single time that there’s a new row, the engine will have to update all the indexes. Plus, the database will need more and more space to save those indexes.

So indexes can help (a lot), but they have their limitations. Here are some other ways to speed up queries.

Other performance considerations

Only ask for the data you need

More rows mean slower queries (most of the time). Similar to a textbook, the more pages the book has, the more time you’ll take to read it. Databases work in the same way. So only ask for the data you need. A common way to cut down on rows is to filter for smaller date ranges.

Caching results

Caching is simply saving the results of the query so you can retrieve the results in the future. You decide how long you want the results to remain valid before they need to be refreshed by another call to the database. Check your BI tool for caching options.

Model your data to meet your needs now, not forever

By data modeling here we just mean how you organize your data into tables, and how those tables relate to each other. Your analytical needs will differ from your operational needs (your app’s data model is likely not the best model for analytical queries). Well-modeled data can really improve performance, but there’s no perfect model. A good data model is one that solves the problems you have today (and for the next few months). Companies that stay alive will grow and evolve over time, so the design choices that you made in the past may no longer suit you in the future.

Check out workload management features for your database

Databases are designed to run queries for multiple clients at the same time. There’s a catch though: to a database engine, a massive query and a quick one have the same priority, so you can have a situation where the massive query will eat up all the server’s resources while the quick query patiently waits for resources to free up.

Some database engines provide workload management features that allow you to assign queries to different queues, which can soften the impact of those big queries hammering your server.

Avoid running queries on your app’s production database

Unless you’re a startup with only a few people using your app, we don’t recommend running queries on the same database that your app uses in production. An expensive query might cause mayhem in your daily operations. We recommend that you at least create a separate data warehouse that’s read-only (often called a read replica, as it’s a copy of your production database) and connect your analytics tool to it.

For data lakes, use partition keys

If you’re using a data lake, the same basic principles outlined here apply. Plus, if your data lake has partition functionality, you should use partition keys to help you avoid massive reads. To make the most of the parallelization, you’ll want to use partition keys in all your queries.

Consider a columnar data warehouse

Traditional relational data warehouses like Postgres are great all-rounders, and they can take you quite far along the analytical path. But if query times become intolerable—and you’ve already tried all the techniques in this article—consider moving your analytical workloads to a columnar engine. Columnar data warehouses are designed for analytical workloads, so they can give you a big performance jump (for a price, of course).

For performance tips for columnar storage, you’ll need to refer to the docs on the specific columnar engine, as each engine has its own quirks and pricing.

You can’t make all your queries blazing fast

What you can do is strike a balance between:

  • Infrastructure costs: Even if you throw more cores and memory at the problem, your query speed will plateau.
  • The indexes you create: If you have too many indexes, you’ll slow down writes.
  • The staleness of the data: If you don’t need data updated in real time, you can save and reuse results.
Was this helpful?

Thanks for your feedback!
Next: Data cubes

Thinking about your data in more than two dimensions.

Next article

Join our weekly digest with the latest data and AI insights you simply can't miss