Lesson

Debugging duplicated data in SQL query results

What to do when your query returns data with duplicated rows or columns.

Where is your data being duplicated?

Duplicated rows

Before you start, make sure you know the schemas of your source tables or nested queries.

  1. Are you missing a GROUP BY clause?
  2. Check if your source tables or nested queries have duplicated rows. You’ll need to repeat steps 3 and 4 for every table or query result that contains duplicate rows.

    -- If the row_count is greater than 1,
    -- you have duplicated rows in your results.
    
     SELECT
         < your_columns >,
         COUNT(*) AS row_count
     FROM
         < your_table_or_upstream_query >
     GROUP BY
         < your_columns >
     ORDER BY
         row_count DESC;
    
  3. Check your table below to see how your join type interacts with your table relationships.
  4. Change your join type or reduce your table relationships.

Explanation

Rows can get duplicated by accident when data gets refreshed in upstream systems or ETL jobs.

Some tables have rows that look like duplicates at a glance. This is common with tables that track state changes (e.g., an order status table that adds a row every time the status changes). State tables may have have rows that look exactly the same, except for the timestamp of the row. It can be difficult to detect if you have tables with a lot of columns, so be sure to run through Step 2 above or ask your database admin if you’re unsure.

If you’ve written your joins assuming a one-to-one relationship for tables that actually have a one-to-many or many-to-many relationship, you’ll get duplicated rows for each match in the “many” table.

Further reading

Join types and table relationships

This table summarizes how join types interact with table relationships to produce duplicates when matching rows are found.

+----------------+------------------------+-----------------------------------+------------------------------+
|                | A is one-to-one with B | A is one-to-many with B           | A is many-to-many with B     |
+----------------+------------------------+-----------------------------------+------------------------------+
| A INNER JOIN B | No duplicate rows.     | No duplicate rows.                | Duplicated rows from A or B. |
+----------------+------------------------+-----------------------------------+------------------------------+
| A LEFT JOIN B  | No duplicate rows.     | Possible duplicates from table B. | Duplicated rows from A or B. |
+----------------+------------------------+-----------------------------------+------------------------------+
| B LEFT JOIN A  | No duplicate rows.     | Possible duplicates from table B. | Duplicated rows from A or B. |
+----------------+------------------------+-----------------------------------+------------------------------+
| A OUTER JOIN B | No duplicate rows.     | Possible duplicates from table B. | Duplicated rows from A or B. |
+----------------+------------------------+-----------------------------------+------------------------------+
| A FULL JOIN B  | No duplicate rows.     | Duplicate rows from table B.      | Duplicated rows from A or B. |
+----------------+------------------------+-----------------------------------+------------------------------+

How to reduce table relationships

If you have duplicated rows because you’re assuming a one-to-one relationship when you actually have tables that are one-to-many or many-to-many, you can remove the duplicates using:

For example:

-- Assume table_a is a one-to-many with table_b.

-- The query below will duplicate rows from table_b
-- for every matching row in table_a.

SELECT
    < your_columns >
FROM
    table_a
    LEFT JOIN table_b ON key_a = key_b;

Option 1: Use an INNER JOIN with a one-to-many relationship.

-- The query below will get one row from table_b
-- for every matching row in table_a.

SELECT
    < your_columns >
FROM
    table_a
    INNER JOIN table_b ON key_a = key_b;

Option 2: Use a CTE to reduce the table relationship.

-- The query below will get aggregated values from table_b
-- for every matching row in table_a.

WITH table_b_reduced AS (
    SELECT
        AGGREGATE_FUNCTION (< your_columns >)
    FROM
        table_b_reduced
    GROUP BY
        < your_columns >
)
SELECT
    < your_columns >
FROM
    table_a
    JOIN table_b_reduced ON key_a = key_b_reduced;

Duplicated columns

  1. If you are joining data, check if your SELECT statement is including both primary key and foreign key columns.
  2. Check if your columns are duplicated at the source by following the steps under Debugging SQL logic.
  3. Learn more about common reasons for unexpected query results.

Do you have a different problem?

Are you still stuck?

Search or ask the Metabase community.

Thanks for your feedback!

Get articles like this one in your inbox every month