By gjrwebber


2009-11-17 02:25:20 8 Comments

I have to add a unique constraint to an existing table. This is fine except that the table has millions of rows already, and many of the rows violate the unique constraint I need to add.

What is the fastest approach to removing the offending rows? I have an SQL statement which finds the duplicates and deletes them, but it is taking forever to run. Is there another way to solve this problem? Maybe backing up the table, then restoring after the constraint is added?

16 comments

@Shamseer PC 2013-11-21 08:55:21

CREATE TABLE test (col text);
INSERT INTO test VALUES
 ('1'),
 ('2'), ('2'),
 ('3'),
 ('4'), ('4'),
 ('5'),
 ('6'), ('6');
DELETE FROM test
 WHERE ctid in (
   SELECT t.ctid FROM (
     SELECT row_number() over (
               partition BY col
               ORDER BY col
               ) AS rnum,
            ctid FROM test
       ORDER BY col
     ) t
    WHERE t.rnum >1);

@Tobias 2018-03-13 09:28:49

I tested it, and it worked; I formatted it for readability. It looks quite sophisticated, but it could use some explanation. How would one change this example for his/her own use case?

@denplis 2017-02-18 10:07:24

DELETE FROM tablename
WHERE id IN (SELECT id
    FROM (SELECT id,ROW_NUMBER() OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum
                 FROM tablename) t
          WHERE t.rnum > 1);

Delete duplicates by column(s) and keep row with lowest id. The pattern is taken from the postgres wiki

Using CTEs you can achieve a more readable version of the above through this

WITH duplicate_ids as (
    SELECT id, rnum 
    FROM num_of_rows
    WHERE rnum > 1
),
num_of_rows as (
    SELECT id, 
        ROW_NUMBER() over (partition BY column1, 
                                        column2, 
                                        column3 ORDER BY id) AS rnum
        FROM tablename
)
DELETE FROM tablename 
WHERE id IN (SELECT id from duplicate_ids)

@naXa 2016-04-11 23:01:54

Generalized query to delete duplicates:

DELETE FROM table_name
WHERE ctid NOT IN (
  SELECT max(ctid) FROM table_name
  GROUP BY column1, [column 2, ...]
);

The column ctid is a special column available for every table but not visible unless specifically mentioned. The ctid column value is considered unique for every row in a table.

@msciwoj 2016-10-26 13:54:03

the only universal answer! Works without self/cartesian JOIN. Worth adding though that it's essential to correctly specify GROUP BY clause - this should be the 'uniqueness criteria' that is violated now or if you'd like the key to detect duplicates. If specified wrong it won't work correctly

@Dominik George 2017-07-26 12:29:09

Thanks for teaching me about ctid!

@Tim 2010-12-14 18:34:14

Some of these approaches seem a little complicated, and I generally do this as:

Given table table, want to unique it on (field1, field2) keeping the row with the max field3:

DELETE FROM table USING table alias 
  WHERE table.field1 = alias.field1 AND table.field2 = alias.field2 AND
    table.max_field < alias.max_field

For example, I have a table, user_accounts, and I want to add a unique constraint on email, but I have some duplicates. Say also that I want to keep the most recently created one (max id among duplicates).

DELETE FROM user_accounts USING user_accounts ua2
  WHERE user_accounts.email = ua2.email AND user_account.id < ua2.id;
  • Note - USING is not standard SQL, it is a PostgreSQL extension (but a very useful one), but the original question specifically mentions PostgreSQL.

@Eric Bowman - abstracto - 2011-11-03 02:07:20

That second approach is very fast on postgres! Thanks.

@Fopa Léon Constantin 2014-04-03 22:00:50

@Tim can you better explain what does USING do in postgresql ?

@Shane 2014-08-04 20:59:40

This is by far the best answer. Even if you don't have a serial column in your table to use for the id comparison, it's worth it to temporarily add one to use this simple approach.

@Parker Selbert 2015-03-17 19:34:45

The USING approach is vastly faster than max comparisons. Great answer.

@André C. Andersen 2015-05-01 17:16:30

@FopaLéonConstantin Will flipping the less-than (<) operator to greater-than (>) operator leave me with the minimum user_account.id?

@André C. Andersen 2015-05-01 17:28:57

I just checked. The answer is yes, it will. Using less-than (<) leaves you with only the max id, while greater-than (>) leaves you with only the min id, deleting the rest.

@Arlen Beiler 2015-06-15 13:19:39

The second approach is much faster if email is indexed. Like 100X faster.

@songololo 2015-10-17 09:01:24

Thank you for the simplicity!!!

@11101101b 2015-12-02 04:26:30

This approach also works for MySQL, you just have to restate the 2nd 'user_accounts' like this: DELETE FROM user_accounts USING user_accounts, user_accounts ua2 WHERE user_accounts.email = ua2.email AND user_account.id < ua2.id;

@alexkovelsky 2016-05-06 13:18:26

You could also compare records, which is shorter to write: WHERE (table.field1, table.field2) = (alias.field1, alias.field2)

@alexkovelsky 2016-05-06 13:21:06

@Shane one can use: WHERE table1.ctid<table2.ctid - no need to add serial column

@peufeu 2009-11-22 01:31:12

First, you need to decide on which of your "duplicates" you will keep. If all columns are equal, OK, you can delete any of them... But perhaps you want to keep only the most recent, or some other criterion?

The fastest way depends on your answer to the question above, and also on the % of duplicates on the table. If you throw away 50% of your rows, you're better off doing CREATE TABLE ... AS SELECT DISTINCT ... FROM ... ;, and if you delete 1% of the rows, using DELETE is better.

Also for maintenance operations like this, it's generally good to set work_mem to a good chunk of your RAM: run EXPLAIN, check the number N of sorts/hashes, and set work_mem to your RAM / 2 / N. Use lots of RAM; it's good for speed. As long as you only have one concurrent connection...

@Skippy le Grand Gourou 2014-09-15 16:48:00

If you have only one or a few duplicated entries, and they are indeed duplicated (that is, they appear twice), you can use the "hidden" ctid column, as proposed above, together with LIMIT:

DELETE FROM mytable WHERE ctid=(SELECT ctid FROM mytable WHERE […] LIMIT 1);

This will delete only the first of the selected rows.

@Skippy le Grand Gourou 2014-09-15 16:51:57

I know it doesn't address OP's issue, who has many duplicated in millions of rows, but it may be helpful anyway.

@bradw2k 2015-04-10 17:33:53

This would have to be run once for each duplicate row. shekwi's answer need only be run once.

@Mark Cupitt 2013-10-28 05:35:50

This works very nicely and is very quick:

CREATE INDEX otherTable_idx ON otherTable( colName );
CREATE TABLE newTable AS select DISTINCT ON (colName) col1,colName,col2 FROM otherTable;

@shekwi 2013-04-04 10:33:35

The PostgreSQL window function is handy for this problem.

DELETE FROM tablename
WHERE id IN (SELECT id
              FROM (SELECT id,
                             row_number() over (partition BY column1, column2, column3 ORDER BY id) AS rnum
                     FROM tablename) t
              WHERE t.rnum > 1);

See Deleting duplicates.

@bradw2k 2015-04-10 17:25:55

And using "ctid" instead of "id", this actually works for fully duplicate rows.

@Jan 2016-01-12 10:15:50

Great solution. I had to do this for a table with a billion records. I added a WHERE to the inner SELECT to do it in chunks.

@Bhavik Ambani 2012-02-13 12:57:12

From an old postgresql.org mailing list:

create table test ( a text, b text );

Unique values

insert into test values ( 'x', 'y');
insert into test values ( 'x', 'x');
insert into test values ( 'y', 'y' );
insert into test values ( 'y', 'x' );

Duplicate values

insert into test values ( 'x', 'y');
insert into test values ( 'x', 'x');
insert into test values ( 'y', 'y' );
insert into test values ( 'y', 'x' );

One more double duplicate

insert into test values ( 'x', 'y');

select oid, a, b from test;

Select duplicate rows

select o.oid, o.a, o.b from test o
    where exists ( select 'x'
                   from test i
                   where     i.a = o.a
                         and i.b = o.b
                         and i.oid < o.oid
                 );

Delete duplicate rows

Note: PostgreSQL dosn't support aliases on the table mentioned in the from clause of a delete.

delete from test
    where exists ( select 'x'
                   from test i
                   where     i.a = test.a
                         and i.b = test.b
                         and i.oid < test.oid
             );

@Kalanidhi 2014-03-31 10:34:45

Your explanation is very smart ,but you are missing one point ,In create table specify the oid then only access the oid else error message display

@Bhavik Ambani 2014-03-31 13:03:55

@Kalanidhi Thanks for your comments regarding improvement of the answer, I will take consideration this point.

@Martin F 2014-04-15 22:13:24

This really came from postgresql.org/message-id/…

@sul4bh 2015-12-15 18:49:54

You can use the system column 'ctid' if 'oid' gives you an error.

@Erwin Brandstetter 2012-01-11 21:26:36

Instead of creating a new table, you can also re-insert unique rows into the same table after truncating it. Do it all in one transaction. Optionally, you can drop the temporary table at the end of the transaction automatically with ON COMMIT DROP. See below.

This approach is only useful where there are lots of rows to delete from all over the table. For just a few duplicates, use a plain DELETE.

You mentioned millions of rows. To make the operation fast you want to allocate enough temporary buffers for the session. The setting has to be adjusted before any temp buffer is used in your current session. Find out the size of your table:

SELECT pg_size_pretty(pg_relation_size('tbl'));

Set temp_buffers accordingly. Round up generously because in-memory representation needs a bit more RAM.

SET temp_buffers = 200MB;    -- example value

BEGIN;

-- CREATE TEMPORARY TABLE t_tmp ON COMMIT DROP AS -- drop temp table at commit
CREATE TEMPORARY TABLE t_tmp AS  -- retain temp table after commit
SELECT DISTINCT * FROM tbl;  -- DISTINCT folds duplicates

TRUNCATE tbl;

INSERT INTO tbl
SELECT * FROM t_tmp;
-- ORDER BY id; -- optionally "cluster" data while being at it.

COMMIT;

This method can be superior to creating a new table if depending objects exist. Views, indexes, foreign keys or other objects referencing the table. TRUNCATE makes you begin with a clean slate anyway (new file in the background) and is much faster than DELETE FROM tbl with big tables (DELETE can actually be faster with small tables).

For big tables, it is regularly faster to drop indexes and foreign keys, refill the table and recreate these objects. As far as fk constraints are concerned you have to be certain the new data is valid of course or you'll run into an exception on trying to create the fk.

Note that TRUNCATE requires more aggressive locking than DELETE. This may be an issue for tables with heavy, concurrent load.

If TRUNCATE is not an option or generally for small to medium tables there is a similar technique with a data-modifying CTE (Postgres 9.1+):

WITH del AS (DELETE FROM tbl RETURNING *)
INSERT INTO tbl
SELECT DISTINCT * FROM del;
-- ORDER BY id; -- optionally "cluster" data while being at it.

Slower for big tables, because TRUNCATE is faster there. But may be faster (and simpler!) for small tables.

If you have no depending objects at all, you might create a new table and delete the old one, but you hardly gain anything over this universal approach.

For very big tables that would not fit into available RAM, creating a new table will be considerably faster. You'll have to weigh this against possible troubles / overhead with depending objects.

@xlash 2012-11-01 22:24:15

I used this approach too. However, it might be personnal, but my temp table was deleted, and not available after the truncate... Be careful to do those steps if temp table was created successfully and is available.

@Erwin Brandstetter 2012-11-01 22:54:42

@xlash: You can check for existence to make sure, and either use a different name for the temp table or reuse the one in existence .. I added a bit to my answer.

@Jordan Arseno 2014-02-27 10:04:45

WARNING: Be careful +1 to @xlash -- I have to re-import my data because the temporary table was non-existent after TRUNCATE. As Erwin said, be sure to make sure it exists before truncating your table. See @codebykat's answer

@Erwin Brandstetter 2014-02-27 11:25:08

@JordanArseno: I switched to a version without ON COMMIT DROP, so that people who miss the part where I wrote "in one transaction" don't lose data. And I added BEGIN / COMMIT to clarify "one transaction".

@Jordan Arseno 2014-02-28 02:19:13

Thx @ErwinBrandstetter

@Fopa Léon Constantin 2014-04-04 00:23:39

@ErwinBrandstetter I think This solution is less efficient when there is not that much duplicate to remove from the original table. And it is worst when there is no duplicate at all. Can you provide some improvement for example to avoid truncating when both t_tmp and the original table have same number of rows (=> there where no duplicate). does DELETE more suitable for those situations ?

@Erwin Brandstetter 2015-02-13 16:44:25

@FopaLéonConstantin: Yes, of course. The suggested procedure only makes sense to delete large portions from a big table.

@sschober 2015-04-21 12:01:22

@ErwinBrandstetter Your last example is missing an N in DISTINCT (one character edits are not allowed, at least for me...).

@Erwin Brandstetter 2015-04-21 15:17:47

@sschober: Thanks, fixed.

@castt 2015-04-27 22:02:42

solution with USING took more than 3 hours on table with 14 million records. This solution with temp_buffers took 13 minutes. Thanks.

@codebykat 2013-11-03 18:27:19

I just used Erwin Brandstetter's answer successfully to remove duplicates in a join table (a table lacking its own primary IDs), but found that there's one important caveat.

Including ON COMMIT DROP means the temporary table will get dropped at the end of the transaction. For me, that meant the temporary table was no longer available by the time I went to insert it!

I just did CREATE TEMPORARY TABLE t_tmp AS SELECT DISTINCT * FROM tbl; and everything worked fine.

The temporary table does get dropped at the end of the session.

@just somebody 2009-11-17 02:36:47

For example you could:

CREATE TABLE tmp ...
INSERT INTO tmp SELECT DISTINCT * FROM t;
DROP TABLE t;
ALTER TABLE tmp RENAME TO t;

@gjrwebber 2009-11-17 03:02:03

Can you make it distinct for group of columns. Maybe "SELECT DISTINCT (t.a, t.b, t.c), * FROM t"?

@just somebody 2009-11-17 03:24:28

@gjrwebber 2009-11-17 06:42:01

Did the trick. Thanks.

@Randal Schwartz 2010-02-15 23:40:06

easier to type: CREATE TABLE tmp AS SELECT ...;. Then you don't need to even figure out what the layout of tmp is. :)

@Topher Fangio 2011-09-19 19:26:33

+1 - Thanks! This just helped me out a ton :-)

@Erwin Brandstetter 2012-01-15 06:44:03

This answer is actually not very good for several reasons. @Randal named one. In most cases, especially if you have depending objects like indexes, constraints, views etc., the superior approach is to use an actual TEMPORARY TABLE, TRUNCATE the original and re-insert the data.

@just somebody 2012-01-15 17:11:39

@ErwinBrandstetter: the question asked for the fastest approach. mass import of data into a table with indexes and constraints is going to take ages. the PostgreSQL manual actually recommends dropping indexes and foreign keys: postgresql.org/docs/9.1/static/populate.html. i'd say your downvote is completely off the mark.

@Erwin Brandstetter 2012-01-15 19:06:45

You are right about indexes. Dropping & recreating is much faster. But other depending objects will break or prevent dropping the table altogether - which the OP would find out after having made the copy - so much for the "fastest approach". Still, you are right about the downvote. It is unfounded, because it is not a bad answer. It is just not that good. You could have added some pointers about indexes or depending objects or a link to the manual like you did in the comment or any kind of explanation. I guess I got frustrated about how people vote. Removed the downvote.

@Rhys van der Waerden 2015-06-10 07:09:08

Could this approach cause cascading deletes on other tables with foreign key references to columns in t?

@Nuno Aniceto 2015-06-23 17:52:18

Use: create table X as table Y; -- to copy the table data info from Y to X (new) Then truncate table X; -- to remove the copied data. Makes easy to abstract from table columns and details, but not so much efficient.

@Secko 2009-11-17 02:38:59

DELETE FROM table
  WHERE something NOT IN
    (SELECT     MAX(s.something)
      FROM      table As s
      GROUP BY  s.this_thing, s.that_thing);

@gjrwebber 2009-11-17 03:00:36

That is what I am currently doing, but it is taking a very long time to run.

@shreedhar 2013-09-24 22:37:30

Wouldn't this fail if multiple rows in table have the same value in column something?

@Jan Marek 2011-05-12 11:05:16

You can use oid or ctid, which is normally a "non-visible" columns in the table:

DELETE FROM table
 WHERE ctid NOT IN
  (SELECT MAX(s.ctid)
    FROM table s
    GROUP BY s.column_has_be_distinct);

@Erwin Brandstetter 2013-10-07 17:37:58

For deleting in place, NOT EXISTS should be considerably faster: DELETE FROM tbl t WHERE EXISTS (SELECT 1 FROM tbl t1 WHERE t1.dist_col = t.dist_col AND t1.ctid > t.ctid)-- or use any other column or set of columns for sorting to to pick a survivor.

@John 2014-03-16 20:59:32

@ErwinBrandstetter, is the query you provide supposed to use NOT EXISTS?

@Erwin Brandstetter 2014-03-17 07:54:50

@John: It must be EXISTS here. Read it like this: "Delete all rows where any other row exists with the same value in dist_col but a bigger ctid". The only survivor per group of dupes will be the one with the biggest ctid.

@Skippy le Grand Gourou 2014-09-15 16:42:10

Easiest solution if you have only a few duplicated rows. Can be used with LIMIT if you know the number of duplicates.

@CM. 2010-02-15 23:16:39

I'm working with PostgreSQL 8.4. When I ran the proposed code, I found that it was not actually removing the duplicates. In running some tests, I found that adding the "DISTINCT ON (duplicate_column_name)" and the "ORDER BY duplicate_column_name" did the trick. I'm no SQL guru, I found this in the PostgreSQL 8.4 SELECT...DISTINCT doc.

CREATE OR REPLACE FUNCTION remove_duplicates(text, text) RETURNS void AS $$
DECLARE
  tablename ALIAS FOR $1;
  duplicate_column ALIAS FOR $2;
BEGIN
  EXECUTE 'CREATE TEMPORARY TABLE _DISTINCT_' || tablename || ' AS (SELECT DISTINCT ON (' || duplicate_column || ') * FROM ' || tablename || ' ORDER BY ' || duplicate_column || ' ASC);';
  EXECUTE 'DELETE FROM ' || tablename || ';';
  EXECUTE 'INSERT INTO ' || tablename || ' (SELECT * FROM _DISTINCT_' || tablename || ');';
  EXECUTE 'DROP TABLE _DISTINCT_' || tablename || ';';
  RETURN;
END;
$$ LANGUAGE plpgsql;

@Ole Tange 2009-12-01 18:25:55

This function removes duplicates without removing indexes and does it to any table.

Usage: select remove_duplicates('mytable');

---
--- remove_duplicates(tablename) removes duplicate records from a table (convert from set to unique set)
---
CREATE OR REPLACE FUNCTION remove_duplicates(text) RETURNS void AS $$
DECLARE
  tablename ALIAS FOR $1;
BEGIN
  EXECUTE 'CREATE TEMPORARY TABLE _DISTINCT_' || tablename || ' AS (SELECT DISTINCT * FROM ' || tablename || ');';
  EXECUTE 'DELETE FROM ' || tablename || ';';
  EXECUTE 'INSERT INTO ' || tablename || ' (SELECT * FROM _DISTINCT_' || tablename || ');';
  EXECUTE 'DROP TABLE _DISTINCT_' || tablename || ';';
  RETURN;
END;
$$ LANGUAGE plpgsql;

Related Questions

Sponsored Content

28 Answered Questions

[SOLVED] How can I prevent SQL injection in PHP?

8 Answered Questions

[SOLVED] How to exit from PostgreSQL command line utility: psql

8 Answered Questions

[SOLVED] Delete Duplicate Records in PostgreSQL

17 Answered Questions

[SOLVED] Finding duplicate rows in SQL Server

27 Answered Questions

[SOLVED] Finding duplicate values in a SQL table

  • 2010-04-07 18:17:29
  • Alex
  • 2439132 View
  • 1657 Score
  • 27 Answer
  • Tags:   sql duplicates

8 Answered Questions

[SOLVED] How does database indexing work?

33 Answered Questions

[SOLVED] How do I UPDATE from a SELECT in SQL Server?

8 Answered Questions

[SOLVED] Find rows that have the same value on a column in MySQL

  • 2009-11-23 22:33:18
  • bobo
  • 345870 View
  • 173 Score
  • 8 Answer
  • Tags:   sql mysql database

15 Answered Questions

[SOLVED] How to Delete using INNER JOIN with SQL Server?

3 Answered Questions

[SOLVED] How to find duplicate records in PostgreSQL

Sponsored Content