By sheats


2008-09-10 15:33:10 8 Comments

I need to retrieve all rows from a table where 2 columns combined are all different. So I want all the sales that do not have any other sales that happened on the same day for the same price. The sales that are unique based on day and price will get updated to an active status.

So I'm thinking:

UPDATE sales
SET status = 'ACTIVE'
WHERE id IN (SELECT DISTINCT (saleprice, saledate), id, count(id)
             FROM sales
             HAVING count = 1)

But my brain hurts going any farther than that.

5 comments

@Erwin Brandstetter 2012-09-28 00:50:06

If you put together the answers so far, clean up and improve, you would arrive at this superior query:

UPDATE sales
SET    status = 'ACTIVE'
WHERE  (saleprice, saledate) IN (
    SELECT saleprice, saledate
    FROM   sales
    GROUP  BY saleprice, saledate
    HAVING count(*) = 1 
    );

Which is much faster than either of them. Nukes the performance of the currently accepted answer by factor 10 - 15 (in my tests on PostgreSQL 8.4 and 9.1).

But this is still far from optimal. Use a NOT EXISTS (anti-)semi-join for even better performance. EXISTS is standard SQL, has been around forever (at least since PostgreSQL 7.2, long before this question was asked) and fits the presented requirements perfectly:

UPDATE sales s
SET    status = 'ACTIVE'
WHERE  NOT EXISTS (
   SELECT FROM sales s1                     -- SELECT list can be empty for EXISTS
   WHERE  s.saleprice = s1.saleprice
   AND    s.saledate  = s1.saledate
   AND    s.id <> s1.id                     -- except for row itself
   )
AND    s.status IS DISTINCT FROM 'ACTIVE';  -- avoid empty updates. see below

db<>fiddle here
Old SQL Fiddle

Unique key to identify row

If you don't have a primary or unique key for the table (id in the example), you can substitute with the system column ctid for the purpose of this query (but not for some other purposes):

   AND    s1.ctid <> s.ctid

Every table should have a primary key. Add one if you didn't have one, yet. I suggest a serial or an IDENTITY column in Postgres 10+.

Related:

How is this faster?

The subquery in the EXISTS anti-semi-join can stop evaluating as soon as the first dupe is found (no point in looking further). For a base table with few duplicates this is only mildly more efficient. With lots of duplicates this becomes way more efficient.

Exclude empty updates

For rows that already have status = 'ACTIVE' this update would not change anything, but still insert a new row version at full cost (minor exceptions apply). Normally, you do not want this. Add another WHERE condition like demonstrated above to avoid this and make it even faster:

If status is defined NOT NULL, you can simplify to:

AND status <> 'ACTIVE';

Subtle difference in NULL handling

This query (unlike the currently accepted answer by Joel) does not treat NULL values as equal. The following two rows for (saleprice, saledate) would qualify as "distinct" (though looking identical to the human eye):

(123, NULL)
(123, NULL)

Also passes in a unique index and almost anywhere else, since NULL values do not compare equal according to the SQL standard. See:

OTOH, GROUP BY, DISTINCT or DISTINCT ON () treat NULL values as equal. Use an appropriate query style depending on what you want to achieve. You can still use this faster query with IS NOT DISTINCT FROM instead of = for any or all comparisons to make NULL compare equal. More:

If all columns being compared are defined NOT NULL, there is no room for disagreement.

@Joel Coehoorn 2012-09-30 21:03:14

Good answer. I'm a sql server guy, so the first suggestion of using a tuple with an IN ( ) check wouldn't occur to me. The not exists suggestion is usually gonna end up with the same execution plan in sql server as the inner join.

@Peter 2013-08-15 14:59:46

Nice. The explanation greatly increases the value of the answer. I'm almost tempted to run some tests with Oracle to see how the plans compare with Postgres and SQLServer.

@alairock 2016-09-29 19:47:08

count(*) is less efficient than something like: count(saleprice) (unless accuracy critical, then yes, use count(*) )

@Erwin Brandstetter 2016-09-29 23:28:49

@alairock: Where did you get that? For Postgres, the opposite is true. While counting all rows, count(*) is more efficient than count(<expression>). Just try it. Postgres has a faster implementation for this variant of the aggregate function. Maybe you are confusing Postgres with some other RDBMS?

@alairock 2016-10-04 20:33:31

@Erwin Brandstetter 2016-10-04 20:36:54

@alairock: I happen to be co-author of that page and it doesn't say anything of the sort.

@trthhrtz 2018-10-22 15:31:41

Why is here needed HAVING COUNT(*) = 1 at all?

@Erwin Brandstetter 2018-10-23 00:23:26

@trthhrtz: Cause the OP is looking for unique combinations.

@WebWanderer 2019-05-07 18:40:34

@ErwinBrandstetter , you are always so on point with your answers across the stack. You've helped throughout the years in a nearly unimaginable amount of ways. As for this example, I knew a few different ways to solve my problem, but I wanted to see that someone had tested the efficiency between the possibilities. Thank you.

@Erwin Brandstetter 2019-05-07 22:33:31

@WebWanderer: That's good to hear. :)

@Abdulhafeth Sartawi 2019-01-31 08:52:09

If your DBMS doesn't support distinct with multiple columns like this:

select distinct(col1, col2) from table

Multi select in general can be executed safely as follows:

select distinct * from (select col1, col2 from table ) as x

As this can work on most of the DBMS and this is expected to be faster than group by solution as you are avoiding the grouping functionality.

@frans eilering 2018-01-13 07:56:09

I want to select the distinct values from one column 'GrondOfLucht' but they should be sorted in the order as given in the column 'sortering'. I cannot get the distinct values of just one column using

Select distinct GrondOfLucht,sortering
from CorWijzeVanAanleg
order by sortering

It will also give the column 'sortering' and because 'GrondOfLucht' AND 'sortering' is not unique, the result will be ALL rows.

use the GROUP to select the records of 'GrondOfLucht' in the order given by 'sortering

SELECT        GrondOfLucht
FROM            dbo.CorWijzeVanAanleg
GROUP BY GrondOfLucht, sortering
ORDER BY MIN(sortering)

@Kerwin Sneijders 2019-01-14 20:13:20

This basically explains what the accepted answer does, but I'd recommend not using such names for an example (at least translate them). PS: I recommend always naming everything in English in all projects even if you're dutch.

@Christian Berg 2008-09-10 16:17:13

The problem with your query is that when using a GROUP BY clause (which you essentially do by using distinct) you can only use columns that you group by or aggregate functions. You cannot use the column id because there are potentially different values. In your case there is always only one value because of the HAVING clause, but most RDBMS are not smart enough to recognize that.

This should work however (and doesn't need a join):

UPDATE sales
SET status='ACTIVE'
WHERE id IN (
  SELECT MIN(id) FROM sales
  GROUP BY saleprice, saledate
  HAVING COUNT(id) = 1
)

You could also use MAX or AVG instead of MIN, it is only important to use a function that returns the value of the column if there is only one matching row.

@Joel Coehoorn 2008-09-10 15:36:09

SELECT DISTINCT a,b,c FROM t

is roughly equivalent to:

SELECT a,b,c FROM t GROUP BY a,b,c

It's a good idea to get used to the GROUP BY syntax, as it's more powerful.

For your query, I'd do it like this:

UPDATE sales
SET status='ACTIVE'
WHERE id IN
(
    SELECT id
    FROM sales S
    INNER JOIN
    (
        SELECT saleprice, saledate
        FROM sales
        GROUP BY saleprice, saledate
        HAVING COUNT(*) = 1 
    ) T
    ON S.saleprice=T.saleprice AND s.saledate=T.saledate
 )

@Erwin Brandstetter 2012-09-30 20:45:11

This query, while correct and being accepted for year now, is extremely inefficient and unnecessarily so. Don't use this. I provided an alternative and some explanation in another answer.

@famargar 2017-01-22 19:47:34

isn't SELECT DISTINCT a,b,c FROM t precisely the same thing as SELECT a,b,c FROM t GROUP BY a,b,c?

@Joel Coehoorn 2017-01-22 19:50:32

@famargar for the simple case, yet, but they have different meanings semantically, and they are different in terms of what you can do for the step when building a larger query. Plus, people on tech forums can often be extremely pedantic about things, I find it often useful to add weasel words to my posts in this context.

Related Questions

Sponsored Content

28 Answered Questions

[SOLVED] How can I prevent SQL injection in PHP?

37 Answered Questions

20 Answered Questions

[SOLVED] LINQ's Distinct() on a particular property

10 Answered Questions

[SOLVED] SQL to find the number of distinct values in a column

  • 2008-09-26 19:52:48
  • Christian Oudard
  • 496909 View
  • 315 Score
  • 10 Answer
  • Tags:   sql distinct

14 Answered Questions

[SOLVED] Select first row in each GROUP BY group?

16 Answered Questions

[SOLVED] How can I do an UPDATE statement with JOIN in SQL?

33 Answered Questions

[SOLVED] How do I UPDATE from a SELECT in SQL Server?

2 Answered Questions

[SOLVED] Using group by on multiple columns

  • 2010-03-10 23:11:23
  • l--''''''---------''''''''''''
  • 891563 View
  • 921 Score
  • 2 Answer
  • Tags:   sql group-by

37 Answered Questions

[SOLVED] How can I remove duplicate rows?

17 Answered Questions

Sponsored Content