By John M Gant


2009-05-11 16:19:11 8 Comments

I've got a SQL Server table with about 50,000 rows in it. I want to select about 5,000 of those rows at random. I've thought of a complicated way, creating a temp table with a "random number" column, copying my table into that, looping through the temp table and updating each row with RAND(), and then selecting from that table where the random number column < 0.1. I'm looking for a simpler way to do it, in a single statement if possible.

This article suggest using the NEWID() function. That looks promising, but I can't see how I could reliably select a certain percentage of rows.

Anybody ever do this before? Any ideas?

15 comments

@SpacePhoenix 2018-06-13 06:05:42

The server-side processing language in use (eg PHP, .net, etc) isn't specified, but if it's PHP, grab the required number (or all the records) and instead of randomising in the query use PHP's shuffle function. I don't know if .net has an equivalent function but if it does then use that if you're using .net

ORDER BY RAND() can have quite a performance penalty, depending on how many records are involved.

@John M Gant 2018-06-14 19:21:07

I don't recall exactly what I was using this for at the time, but I was probably working in C#, maybe on a server, or maybe in a client application, not sure. C# doesn't have anything directly comparable to PHP's shuffle afaik, but it could be done by applying functions from the Random object within a Select operation, ordering the result, and then taking the top ten percent. But we'd have to read the whole table from disk on the DB server and transmit it over the network, only to discard 90% of that data. Processing it directly in the DB is almost certainly more efficient.

@VISHMAY 2017-11-10 12:40:18

I was using it in subquery and it returned me same rows in subquery

 SELECT  ID ,
            ( SELECT TOP 1
                        ImageURL
              FROM      SubTable 
              ORDER BY  NEWID()
            ) AS ImageURL,
            GETUTCDATE() ,
            1
    FROM    Mytable

then i solved with including parent table variable in where

SELECT  ID ,
            ( SELECT TOP 1
                        ImageURL
              FROM      SubTable 
              Where Mytable.ID>0
              ORDER BY  NEWID()
            ) AS ImageURL,
            GETUTCDATE() ,
            1
    FROM    Mytable

Note the where condtition

@RJardines 2014-10-15 21:51:56

This link have a interesting comparison between Orderby(NEWID()) and other methods for tables with 1, 7, and 13 millions of rows.

Often, when questions about how to select random rows are asked in discussion groups, the NEWID query is proposed; it is simple and works very well for small tables.

SELECT TOP 10 PERCENT *
  FROM Table1
  ORDER BY NEWID()

However, the NEWID query has a big drawback when you use it for large tables. The ORDER BY clause causes all of the rows in the table to be copied into the tempdb database, where they are sorted. This causes two problems:

  1. The sorting operation usually has a high cost associated with it. Sorting can use a lot of disk I/O and can run for a long time.
  2. In the worst-case scenario, tempdb can run out of space. In the best-case scenario, tempdb can take up a large amount of disk space that never will be reclaimed without a manual shrink command.

What you need is a way to select rows randomly that will not use tempdb and will not get much slower as the table gets larger. Here is a new idea on how to do that:

SELECT * FROM Table1
  WHERE (ABS(CAST(
  (BINARY_CHECKSUM(*) *
  RAND()) as int)) % 100) < 10

The basic idea behind this query is that we want to generate a random number between 0 and 99 for each row in the table, and then choose all of those rows whose random number is less than the value of the specified percent. In this example, we want approximately 10 percent of the rows selected randomly; therefore, we choose all of the rows whose random number is less than 10.

Please read the full article in MSDN.

@bummi 2014-10-15 22:09:27

Hi Deumber, nice found, you might flesh it out since link only answers are likely to be deleted.

@QMaster 2016-09-22 11:36:50

@bummi I changed it to avoid being link only answer :)

@Nanki 2015-10-16 17:03:52

This is a combination of the initial seed idea and a checksum, which looks to me to give properly random results without the cost of NEWID():

SELECT TOP [number] 
FROM table_name
ORDER BY RAND(CHECKSUM(*) * RAND())

@Hai Phan 2015-05-07 22:17:37

It appears newid() can't be used in where clause, so this solution requires an inner query:

SELECT *
FROM (
    SELECT *, ABS(CHECKSUM(NEWID())) AS Rnd
    FROM MyTable
) vw
WHERE Rnd % 100 < 10        --10%

@klyd 2014-07-29 17:45:15

Didn't quite see this variation in the answers yet. I had an additional constraint where I needed, given an initial seed, to select the same set of rows each time.

For MS SQL:

Minimum example:

select top 10 percent *
from table_name
order by rand(checksum(*))

Normalized execution time: 1.00

NewId() example:

select top 10 percent *
from table_name
order by newid()

Normalized execution time: 1.02

NewId() is insignificantly slower than rand(checksum(*)), so you may not want to use it against large record sets.

Selection with Initial Seed:

declare @seed int
set @seed = Year(getdate()) * month(getdate()) /* any other initial seed here */

select top 10 percent *
from table_name
order by rand(checksum(*) % @seed) /* any other math function here */

If you need to select the same set given a seed, this seems to work.

@QMaster 2016-09-22 11:42:02

Is there any advantage of using special @seed against RAND() ?

@klyd 2016-09-23 17:05:09

Not 100% sure what you're asking, mind clarifying?

@QMaster 2016-09-24 22:35:15

absolutely, You used seed parameter and fill it by date parameter, RAND() function do the same except using the complete time value, I want to know is any advantage to using handy created parameter like seed above RAND() or not?

@klyd 2016-09-26 18:31:45

Ah!. OK, this was a requirement of the project. I needed to generate a list of n-random rows in a deterministic way. Basically leadership wanted to know what "random" rows we'd be selecting a few days before the rows were selected and processed. By building a seed value based on the year/month I could guarantee any call to the query that year would return the same "random" list. I know, it was strange and there were probably better ways but it worked...

@QMaster 2016-09-26 20:21:28

HAHA :) I see, but I think general meaning of random selected records is not the same records on different running query.

@Jeff Ferland 2009-05-11 16:30:59

In MySQL you can do this:

SELECT `PRIMARY_KEY`, rand() FROM table ORDER BY rand() LIMIT 5000;

@Tom H 2009-05-11 16:42:56

This will not work. Since the select statement is atomic, it only grabs one random number and duplicates it for each row. You would have to reseed it on each row to force it to change.

@Jeff Ferland 2009-05-11 16:44:17

Mmm... love vendor differences. Select is atomic on MySQL, but I suppose in a different way. This will work in MySQL.

@Oskar Austegard 2012-09-26 17:23:46

If you (unlike the OP) need a specific number of records (which makes the CHECKSUM approach difficult) and desire a more random sample than TABLESAMPLE provides by itself, and also want better speed than CHECKSUM, you may make do with a merger of the TABLESAMPLE and NEWID() methods, like this:

DECLARE @sampleCount int = 50
SET STATISTICS TIME ON

SELECT TOP (@sampleCount) * 
FROM [yourtable] TABLESAMPLE(10 PERCENT)
ORDER BY NEWID()

SET STATISTICS TIME OFF

In my case this is the most straightforward compromise between randomness (it's not really, I know) and speed. Vary the TABLESAMPLE percentage (or rows) as appropriate - the higher the percentage, the more random the sample, but expect a linear drop off in speed. (Note that TABLESAMPLE will not accept a variable)

@Kyle McClellan 2012-09-05 22:23:02

Selecting Rows Randomly from a Large Table on MSDN has a simple, well-articulated solution that addresses the large-scale performance concerns.

  SELECT * FROM Table1
  WHERE (ABS(CAST(
  (BINARY_CHECKSUM(*) *
  RAND()) as int)) % 100) < 10

@John M Gant 2012-09-06 17:37:58

Very interesting. After reading the article, I don't really understand why RAND() doesn't return the same value for each row (which would defeat the BINARY_CHECKSUM() logic). Is it because it's being called inside another function rather than being part of the SELECT clause?

@Mark Melville 2014-01-09 17:42:57

This query ran on a table with 6MM rows in less than a second.

@Oliver 2014-02-21 10:27:59

I've ran this query on a table with 35 entries and kept having two of them in the result set very often. This might be a problem with rand() or a combination of the above - but I turned away from this solution for that reason. Also the number of results varied from 1 to 5 so this might also not be acceptable in some scenarios.

@Hai Phan 2015-05-07 22:08:35

Doesn't RAND() return the same value for every row?

@Brian 2019-04-03 15:38:31

RAND() returns the same value for every row (which is why this solution is fast). However, rows with binary checksums that are very close together are at high risk of generating similar checksum results, causing clumping when RAND() is small. E.g., (ABS(CAST((BINARY_CHECKSUM(111,null,null) * 0.1) as int))) % 100 == SELECT (ABS(CAST((BINARY_CHECKSUM(113,null,null) * 0.1) as int))) % 100. If your data suffers from this problem, multiply BINARY_CHECKSUM by 9923.

@Brian 2019-04-03 15:41:12

I picked 9923 somewhat arbitrarily. However, I wanted it to be prime (though coprime with 100 would probably be sufficient). Also because as long as RAND() isn't very tiny, 9923 is large enough to spread out the clumps.

@Patrick Taylor 2009-05-11 20:15:58

Depending on your needs, TABLESAMPLE will get you nearly as random and better performance. this is available on MS SQL server 2005 and later.

TABLESAMPLE will return data from random pages instead of random rows and therefore deos not even retrieve data that it will not return.

On a very large table I tested

select top 1 percent * from [tablename] order by newid()

took more than 20 minutes.

select * from [tablename] tablesample(1 percent)

took 2 minutes.

Performance will also improve on smaller samples in TABLESAMPLE whereas it will not with newid().

Please keep in mind that this is not as random as the newid() method but will give you a decent sampling.

See the MSDN page.

@Oskar Austegard 2012-09-26 16:16:06

As pointed out by Rob Boek below, tablesampling clumps results, and is therefore not a good way to get a small number of random results

@FrenkyB 2015-03-08 08:26:23

You mind the question how this works: select top 1 percent * from [tablename] order by newid() since newid() is not a column in the [tablename]. Is sql server appending internally column newid() on each row and then make a sort ?

@jessier3 2016-11-23 12:50:46

The tablesample was the best answer for me as I was doing a complex query on a very large table. No question that it was remarkably fast. I did get a variation in the number records returned as I ran this multiple times but all of them were within an acceptable margin of error.

@Deep 2010-12-10 12:20:17

This works for me:

SELECT * FROM table_name
ORDER BY RANDOM()
LIMIT [number]

@John M Gant 2010-12-10 17:27:15

@user537824, did you try that on SQL Server? RANDOM isn't a function and LIMIT isn't a keyword. The SQL Server syntax for what you're doing would be select top 10 percent from table_name order by rand(), but that also doesn't work because rand() returns the same value on all rows.

@Ravi Parashar 2011-09-27 07:59:48

Try this:

SELECT TOP 10 Field1, ..., FieldN
FROM Table1
ORDER BY NEWID()

@Ralph Shillington 2009-05-11 16:26:29

select top 10 percent * from [yourtable] order by newid()

In response to the "pure trash" comment concerning large tables: you could do it like this to improve performance.

select  * from [yourtable] where [yourPk] in 
(select top 10 percent [yourPk] from [yourtable] order by newid())

The cost of this will be the key scan of values plus the join cost, which on a large table with a small percentage selection should be reasonable.

@JoshBerke 2009-05-11 16:30:01

I like this approach much better then using the article he referenced.

@user12861 2011-08-23 20:18:24

It's always good to keep in mind that newid() isn't a really good pseudorandom number generator, at least not nearly as good as rand(). But if you just need some vaguely randomish samples and don't care about mathematical qualities and such, it'll be good enough. Otherwise you need: stackoverflow.com/questions/249301/…

@Snailer 2015-01-24 15:34:33

Um, sorry if this is obvious.. but what does [yourPk] refer to? EDIT: Nvm, figured it out... Primary Key. Durrr

@Brans Ds 2015-06-03 10:23:26

newid - guid is disigned to be unique but not random.. incorrect approach

@Arbaz Abid 2017-05-08 13:49:54

with large number of rows for example over 1 million newid() Sort Estimate I/O cost will be very high and will effect performance.

@Nadir Sidi 2018-03-08 16:36:01

The use of rand() versus newid() also allows you to set a seed, which is useful if you want your query to be repeatable.

@Rob Boek 2009-05-28 18:15:18

newid()/order by will work, but will be very expensive for large result sets because it has to generate an id for every row, and then sort them.

TABLESAMPLE() is good from a performance standpoint, but you will get clumping of results (all rows on a page will be returned).

For a better performing true random sample, the best way is to filter out rows randomly. I found the following code sample in the SQL Server Books Online article Limiting Results Sets by Using TABLESAMPLE:

If you really want a random sample of individual rows, modify your query to filter out rows randomly, instead of using TABLESAMPLE. For example, the following query uses the NEWID function to return approximately one percent of the rows of the Sales.SalesOrderDetail table:

SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(),SalesOrderID) & 0x7fffffff AS float)
              / CAST (0x7fffffff AS int)

The SalesOrderID column is included in the CHECKSUM expression so that NEWID() evaluates once per row to achieve sampling on a per-row basis. The expression CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float / CAST (0x7fffffff AS int) evaluates to a random float value between 0 and 1.

When run against a table with 1,000,000 rows, here are my results:

SET STATISTICS TIME ON
SET STATISTICS IO ON

/* newid()
   rows returned: 10000
   logical reads: 3359
   CPU time: 3312 ms
   elapsed time = 3359 ms
*/
SELECT TOP 1 PERCENT Number
FROM Numbers
ORDER BY newid()

/* TABLESAMPLE
   rows returned: 9269 (varies)
   logical reads: 32
   CPU time: 0 ms
   elapsed time: 5 ms
*/
SELECT Number
FROM Numbers
TABLESAMPLE (1 PERCENT)

/* Filter
   rows returned: 9994 (varies)
   logical reads: 3359
   CPU time: 641 ms
   elapsed time: 627 ms
*/    
SELECT Number
FROM Numbers
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), Number) & 0x7fffffff AS float) 
              / CAST (0x7fffffff AS int)

SET STATISTICS IO OFF
SET STATISTICS TIME OFF

If you can get away with using TABLESAMPLE, it will give you the best performance. Otherwise use the newid()/filter method. newid()/order by should be last resort if you have a large result set.

@Andrew Mao 2012-09-19 20:49:27

I saw that article too and trying it on my code, it seems that NewID() is evaluated only once, instead of per row, which I don't like...

@Daniel Brückner 2009-05-11 16:31:11

Just order the table by a random number and obtain the first 5,000 rows using TOP.

SELECT TOP 5000 * FROM [Table] ORDER BY newid();

UPDATE

Just tried it and a newid() call is sufficent - no need for all the casts and all the math.

@hkf 2014-10-03 00:48:01

The reason that 'all the casts and all the maths' is used is for better performance.

Related Questions

Sponsored Content

25 Answered Questions

[SOLVED] How do I perform an IF...THEN in an SQL SELECT?

12 Answered Questions

[SOLVED] How to randomly select an item from a list?

  • 2008-11-20 18:42:21
  • Ray Vega
  • 1247595 View
  • 1570 Score
  • 12 Answer
  • Tags:   python list random

27 Answered Questions

24 Answered Questions

[SOLVED] Check if table exists in SQL Server

43 Answered Questions

[SOLVED] How to return only the Date from a SQL Server DateTime datatype

45 Answered Questions

27 Answered Questions

[SOLVED] Finding duplicate values in a SQL table

  • 2010-04-07 18:17:29
  • Alex
  • 2439240 View
  • 1657 Score
  • 27 Answer
  • Tags:   sql duplicates

33 Answered Questions

[SOLVED] How do I UPDATE from a SELECT in SQL Server?

4 Answered Questions

[SOLVED] Inserting multiple rows in a single SQL query?

37 Answered Questions

Sponsored Content