How to speed up queries
Tips and tricks for improving query performance: indexes, materialized views, and more.
Use EXPLAIN
to see how query is executed
In SQL, you specify what data you want to get, but you don’t specify how you want to get it. Every data warehouse has a query planner whose job is to figure out how to execute the query. This query planner figures out the best way of running the query based on execution time, resources used, and/or database presets. For example, the query planner will decide whether it should read all the rows in the table or narrow its search to a specific set of rows based on a certain condition.
To see how the database will execute your query, you can prepend your query with EXPLAIN
. The query planner will list the operations it will carry out, in which order, using which indexes, etc. There’s also EXPLAIN ANALYZE
, which gives you actual query execution statistics. But to get those statistics, the database will actually execute the query, which can be time consuming or even undesirable for queries that modify the data.
To see EXPLAIN
in action, let’s first create some fake data in a PostgreSQL database (other databases might work, though you may have to adjust the code). Run the following statement (you’ll need write permissions for the data warehouse).
DROP TABLE IF EXISTS data_table;
CREATE TABLE data_table (
id SERIAL PRIMARY KEY,
date_field DATE NOT NULL,
category TEXT NOT NULL CHECK (category IN ('A', 'B', 'C')),
numeric_value NUMERIC NOT NULL
);
INSERT INTO data_table (date_field, category, numeric_value)
SELECT
CURRENT_DATE - (random() * 365)::INT, -- Random date within the last year
CASE FLOOR(random() * 3)
WHEN 0 THEN 'A'
WHEN 1 THEN 'B'
ELSE 'C'
END, -- Random category
(random() * 1000)::NUMERIC -- Random numeric value between 0 and 1000
FROM generate_series(1, 50000000);
This query might take a while. To check if it worked, run:
SELECT
*
FROM
data_table
LIMIT
10;
You’ll get something like:
| id | date_field | category | numeric_value |
| --- | ----------------- | -------- | ------------- |
| 1 | August 2, 2024 | A | 874.17 |
| 2 | October 31, 2024 | A | 762.03 |
| 3 | August 23, 2024 | A | 718.73 |
| 4 | February 6, 2025 | C | 334.45 |
| 5 | August 28, 2024 | A | 59.4 |
| 6 | August 8, 2024 | A | 972.74 |
| 7 | October 18, 2024 | B | 296.99 |
| 8 | November 27, 2024 | B | 858.26 |
| 9 | September 5, 2024 | C | 137.84 |
| 10 | February 24, 2025 | C | 701.68 |
Let’s do a simple sum of numeric_value
by category
and see how long it takes. Here’s the query:
SELECT
category AS "Category",
SUM(numeric_value) AS "Sum"
FROM
data_table
GROUP BY
category
ORDER BY
category ASC
You’ll get something like:
| Category | Sum |
| -------- | ---------------- |
| A | 8,336,275,140.07 |
| B | 8,330,139,598.5 |
| C | 8,334,188,258.65 |
Running that query on our PostgreSQL 17 with 2 CPU cores and 2GB of RAM took about 8 seconds - a pretty long time.
Now let’s run EXPLAIN
on the query. EXPLAIN
will output the plan for the query, along with its costs and other data about the query. Just prepend the query with EXPLAIN
:
EXPLAIN
SELECT
category AS "Category",
SUM(numeric_value) AS "Sum"
FROM
data_table
GROUP BY
category
ORDER BY
category ASC;
You’ll see the following output (which can vary depending on your current deployment):
QUERY PLAN
Finalize GroupAggregate (cost=631972.08..631972.87 rows=3 width=34)
Group Key: category
-> Gather Merge (cost=631972.08..631972.78 rows=6 width=34)
Workers Planned: 2
-> Sort (cost=630972.06..630972.07 rows=3 width=34)
Sort Key: category
-> Partial HashAggregate (cost=630972.00..630972.04 rows=3 width=34)
Group Key: category
-> Parallel Seq Scan on data_table (cost=0.00..526805.33 rows=20833333 width=13)
JIT:
Functions: 7
Options: Inlining true, Optimization true, Expressions true, Deforming true
EXPLAIN
’s output should be read bottom to top. This query plan says:
-
The database will first scan all records in the table sequentially (
Seq Scan
) in parallel on the available two cores (Workers Planned: 2
). -
Then, the database will aggregate (
HashAggregate
) the data, and do it on each worker separately (Partial
).The database has to aggregate because we included a
GROUP BY
clause and aggregationSUM
. The aggregation can be done in parallel because summing is an operation that can be parallelized: summing the total sums over two partitions is the same as just summing over all the data. -
Then, the database will sort the aggregated results on each worker separately (
Sort
).The
Sort
is there, surprisingly, not because we included anORDER BY
clause - if you remove theORDER BY
clause, theSort
step will still be there. The database sorts the results of aggregation because it needs to merge the results from two workers, and it’s more efficient to do that when the results are sorted.If instead of
ORDER BY category
you hadORDER BY "Sum"
, there would be two sorts in the query plan - with an additional sort after theGroupAggregate
. -
Next, the database will merge the results from two workers (
Gather Merge
).At this point, the database hasn’t yet added up the sums for each category. We know it hasn’t because after aggregation the
rows
value is3
: each worker has three rows, one for each category (A, B, and C), but afterGather Merge
we have 6rows
, so the database has yet to combine the sums from each worker. -
Finally, the database will group and aggregate the results from the two workers (
GroupAggregate
).
Performance-wise, this output tells us:
- The query processor makes use of the 2 cores (
Workers Planned: 2
). - The most costly operation was
Parallel Seq Scan
(0.00..526805.33). - The other operations had negligible impact on overall query time (e.g.,
630972.00..630972.04
forHashAggregate
- the difference between two numbers, i.e., the cost, is very small compared to the cost forParallel Seq Scan
.)
So how can we make this query stop obliterating the database when people check their dashboard each morning? The database has to read every row in the table to sum them, so we can’t reduce the total number of rows scanned. The best we can do is to speed up the scan. Our options include:
- Increasing the number of CPU cores, so that each core has to go through fewer rows.
- Switching to a database with an architecture optimized for faster sequential scans of single columns (like a columnar database).
If infrastructure changes are out of the question, you could instead precompute the aggregation results in advance - useful if people run the query often - and use caching results or materializing views.
Create materialized views for expensive joins, CTEs, and views
Materialized views are pre-computed queries that you can save in your data warehouse. Materialized views take the strain off your database by storing computed results for people to query, instead of computing the rows each time they query the database. You can periodically refresh materialized views to keep their results up to date with new data.
Let’s create a materialized view with the query that we did before:
CREATE MATERIALIZED VIEW preprocessed_sum_of_numeric_value_by_category AS
SELECT
category AS "Category",
SUM(numeric_value) AS "Sum"
FROM
data_table
GROUP BY
category
ORDER BY
category ASC
This will save the results of query:
-- preprocessed_sum_of_numeric_value_by_category
| Category | Sum |
| -------- | ---------------- |
| A | 8,336,275,140.07 |
| B | 8,330,139,598.5 |
| C | 8,334,188,258.65 |
Now all we need to do to get the sum for each category is to select from this materialized view:
SELECT
*
FROM
preprocessed_sum_of_numeric_value_by_category;
Querying the materialized view could get the time down from seconds to milliseconds.
Depending on your database, materialized views might or might not get updated when the underlying data changes. For example, if you’re using PostgreSQL, you’ll need to update your materialized view. To recompute the results, you’ll need a script to run REFRESH MATERIALIZED VIEW preprocessed_sum_of_numeric_value_by_category
on a certain cadence. How often you refresh the results depends on how often the underlying data changes and how often people check the results.
Here are some situations where you should consider materializing views:
Joins
Table joins are one of the most beautiful things in the relational database world (or at least we think so). Joins allow you to organize your business in modular sections that can be put together with just a simple keyword. But joins can also be pretty dangerous: a join against a mammoth table can really bog down your data warehouse. If you need to join more than two or three tables each time you make a query, consider materializing the base query with the joins into a materialized view.
CTEs
CTEs are great for modularizing your code. But your data warehouse will have to compute these CTEs every single time you run these queries. If you find yourself running the same queries full of CTEs every day, it’s time you model your data in such a way that your queries are just a simple SELECT * FROM table
where table
is a materialized view or a table you created with the results of these CTEs.
Views (not materialized views)
Instead of having to write hundreds of lines of SQL each time you do a query, you just simply select the rows from a view you saved and voilá: you get the same results. Views are just abstractions; they don’t give any performance benefits. Use views when you want to reduce the complexity of queries. If you’re using the results often, consider materializing the view.
Index frequently queried columns
You might have heard about indexes at some point. Your queries are slow, a person knowledgeable in databases tells you to create an index, and boom: queries are faster now. An index is like a lookup table that helps your database quickly find specific rows without having to scan through the entire table. Just like how a book’s index helps you find topics without reading every page, a database index points directly to where data is stored. They work well when you need to look up some of the data, but not all of the data.
When you add an index to one or more columns, the database will create a copy of those columns that’s optimized for looking up specific values, and also contains a pointer to get the values from the other (non-indexed) columns.
For example, let’s say you have a table with customer data, containing customer IDs, names, sign-up dates etc. If you wanted to find a customer with a specific name, the database would have to scan every single record and compare the name to the one you requested, until it finds a match. But if you put an index on the name column, the database will have other means of doing this. Exact implementations of indexes can vary, but roughly, you can think of index as a sorted list of all the names, stored together with a pointer to the exact record with that name in the database. Because the list of names is sorted in the index, the database needs much less time to locate the name you asked for. Once the name is located, it’ll use the pointer stored in the index to retrieve the rest of the data for that customer.
Your database will always have an index on the primary key (that’s why looking up a record by ID is usually faster than, for example, first and last name).
Let’s go back to the previous example: the SUM(numeric_value)
over all categories in the table. Would an index make that question faster? Definitely not, as that question still has to go over the whole table to calculate the result.
Now, let’s imagine that someone wants to do a query that only looks up part of a table: the sum of numeric value by category, but only for the last 60 days. Let’s try it:
SELECT
category AS "Category",
SUM(numeric_value) AS "Sum"
FROM
data_table
WHERE
date_field <= CURRENT_DATE + INTERVAL '1' DAY
AND date_field >= CURRENT_DATE - INTERVAL '60' DAY
GROUP BY
category
ORDER BY
category ASC
That took a while! Let’s run the EXPLAIN
to see what’s going on here:
QUERY PLAN
Finalize GroupAggregate (cost=857409.42..857410.21 rows=3 width=34)
Group Key: category
-> Gather Merge (cost=857409.42..857410.12 rows=6 width=34)
Workers Planned: 2
-> Sort (cost=856409.40..856409.41 rows=3 width=34)
Sort Key: category
-> Partial HashAggregate (cost=856409.34..856409.38 rows=3 width=34)
Group Key: category
-> Parallel Seq Scan on data_table (cost=0.00..839305.33 rows=3420801 width=13)
Filter: ((date_field <= (CURRENT_DATE + '1 day'::interval day)) AND (date_field >= (CURRENT_DATE - '60 days'::interval day)))
JIT:
Functions: 9
Options: Inlining true, Optimization true, Expressions true, Deforming true
Once again, we see the costly sequential scan operation. Would an index on the date_field
help?
Creating an index
Let’s create an index on the date_field
column:
CREATE INDEX idx_data_table_date
ON data_table (date_field);
The database needs some time to create an index, and it’ll lock down the table from any write operations until it finishes the index.
You can check the index with a query:
SELECT
tablename,
indexname,
indexdef
FROM
pg_indexes
WHERE tablename = 'data_table'
Writing a query that takes advantage of an index
Once the index is up, let’s run the query again to see if it uses the index and cuts down on the time it takes to return results.
If you run EXPLAIN
again, you’ll see that the index wasn’t even used, and you still get the parallel sequential scan (and the query is still slow).
Let’s see if we can get the database to use the index with specific, not relative dates.
EXPLAIN
SELECT
category AS "Category",
SUM(numeric_value) AS "Sum"
FROM
data_table
WHERE
date_field BETWEEN DATE '2025-01-01' AND DATE '2025-03-01'
GROUP BY
category
ORDER BY
category ASC;
Which gives us:
QUERY PLAN
Finalize GroupAggregate (cost=649104.46..649105.24 rows=3 width=34)
Group Key: category
-> Gather Merge (cost=649104.46..649105.16 rows=6 width=34)
Workers Planned: 2
-> Sort (cost=648104.44..648104.44 rows=3 width=34)
Sort Key: category
-> Partial HashAggregate (cost=648104.38..648104.41 rows=3 width=34)
Group Key: category
-> Parallel Seq Scan on data_table (cost=0.00..630972.00 rows=3426475 width=13)
Filter: ((date_field >= '2025-01-01'::date) AND (date_field <= '2025-03-01'::date))
JIT:
Functions: 9
Options: Inlining true, Optimization true, Expressions true, Deforming true
Using specific dates cut the cost from cost=0.00..839305.33
to cost=0.00..630972.00
.
But isn’t the engine still doing a full table scan? Yes, it is (Parallel Seq Scan on data_table
). To get the database to use the index, we’ll need to narrow the time span. Let’s change the BETWEEN
condition to narrow the time span to two weeks, say 2025-03-01
to 2025-03-15
. Run EXPLAIN
again. You’ll see something like:
QUERY PLAN
Finalize GroupAggregate (cost=633296.94..633297.72 rows=3 width=34)
Group Key: category
-> Gather Merge (cost=633296.94..633297.64 rows=6 width=34)
Workers Planned: 2
-> Sort (cost=632296.92..632296.92 rows=3 width=34)
Sort Key: category
-> Partial HashAggregate (cost=632296.86..632296.89 rows=3 width=34)
Group Key: category
-> Parallel Bitmap Heap Scan on data_table (cost=27900.15..628034.34 rows=852503 width=13)
Recheck Cond: ((date_field >= '2025-03-01'::date) AND (date_field <= '2025-03-15'::date))
-> Bitmap Index Scan on idx_data_table_date (cost=0.00..27388.65 rows=2046008 width=0)
Index Cond: ((date_field >= '2025-03-01'::date) AND (date_field <= '2025-03-15'::date))
JIT:
Functions: 9
Options: Inlining true, Optimization true, Expressions true, Deforming true
See the Index Cond
? Now we’re taking advantage of the index we created. What was previously a sequential scan is now an index scan.
Seems magical, right? Well… not so much. Because as you’ve already seen, if you query for a long enough time span, you’ll still trigger a full table scan. And there’s a reason for that.
Let’s imagine the following scenario: you have a book with a thousand pages. You’re an avid reader who reads one hundred pages per day. Now let’s say that you want to find specific lines in a book. You can’t use an index to find those lines; you’d need to scan the entire book. Now let’s imagine that, instead of finding the specific lines, you need to find the chapter that covers a topic: that’s a case where an index comes in handy.
Wait, but these are computers, right? Like, incredibly fast readers? Can’t I just make an index for every column so it covers all the possible cases? Well, you definitely could, but all those indexes will hinder the table’s write capacity: every single time that there’s a new row, the engine will have to update all the indexes. Plus, the database will need more and more space to save those indexes.
So indexes can help (a lot), but they have their limitations. Here are some other ways to speed up queries.
Other performance considerations
Only ask for the data you need
More rows mean slower queries (most of the time). Similar to a textbook, the more pages the book has, the more time you’ll take to read it. Databases work in the same way. So only ask for the data you need. A common way to cut down on rows is to filter for smaller date ranges.
Caching results
Caching is simply saving the results of the query so you can retrieve the results in the future. You decide how long you want the results to remain valid before they need to be refreshed by another call to the database. Check your BI tool for caching options.
Model your data to meet your needs now, not forever
By data modeling here we just mean how you organize your data into tables, and how those tables relate to each other. Your analytical needs will differ from your operational needs (your app’s data model is likely not the best model for analytical queries). Well-modeled data can really improve performance, but there’s no perfect model. A good data model is one that solves the problems you have today (and for the next few months). Companies that stay alive will grow and evolve over time, so the design choices that you made in the past may no longer suit you in the future.
Check out workload management features for your database
Databases are designed to run queries for multiple clients at the same time. There’s a catch though: to a database engine, a massive query and a quick one have the same priority, so you can have a situation where the massive query will eat up all the server’s resources while the quick query patiently waits for resources to free up.
Some database engines provide workload management features that allow you to assign queries to different queues, which can soften the impact of those big queries hammering your server.
Avoid running queries on your app’s production database
Unless you’re a startup with only a few people using your app, we don’t recommend running queries on the same database that your app uses in production. An expensive query might cause mayhem in your daily operations. We recommend that you at least create a separate data warehouse that’s read-only (often called a read replica, as it’s a copy of your production database) and connect your analytics tool to it.
For data lakes, use partition keys
If you’re using a data lake, the same basic principles outlined here apply. Plus, if your data lake has partition functionality, you should use partition keys to help you avoid massive reads. To make the most of the parallelization, you’ll want to use partition keys in all your queries.
Consider a columnar data warehouse
Traditional relational data warehouses like PostgreSQL are great all-rounders, and they can take you quite far along the analytical path. But if query times become intolerable—and you’ve already tried all the techniques in this article—consider moving your analytical workloads to a columnar engine. Columnar data warehouses are designed for analytical workloads, so they can give you a big performance jump (for a price, of course).
For performance tips for columnar storage, you’ll need to refer to the docs on the specific columnar engine, as each engine has its own quirks and pricing.
You can’t make all your queries blazing fast
What you can do is strike a balance between:
- Infrastructure costs: Even if you throw more cores and memory at the problem, your query speed will plateau.
- The indexes you create: If you have too many indexes, you’ll slow down writes.
- The staleness of the data: If you don’t need data updated in real time, you can save and reuse results.
Further reading
Was this helpful?
Next: Data cubes
Thinking about your data in more than two dimensions.