2012-08-13 16:03:09 8 Comments

I have an application that uses GUID as the Primary Key in almost all tables and I have read that there are issues about performance when using GUID as Primary Key. Honestly, I haven't seen any problem, but I'm about to start a new application and I still want to use the GUIDs as the Primary Keys, but I was thinking of using a Composite Primary Key (The GUID and maybe another field.)

I'm using a GUID because they are nice and easy to manage when you have different environments such as "production", "test" and "dev" databases, and also for migration data between databases.

I will use Entity Framework 4.3 and I want to assign the Guid in the application code, before inserting it in the database. (i.e. I don't want to let SQL generate the Guid).

What is the best practice for creating GUID-based Primary Keys, in order to avoid the supposed performance hits associated with this approach?


@Robert J. Good 2015-03-31 21:27:41

I've been using GUIDs as PKs since 2005. In this distributed database world, it is absolutely the best way to merge distributed data. You can fire and forget merge tables without all the worry of ints matching across joined tables. GUIDs joins can be copied without any worry.

This is my setup for using GUIDs:

  1. PK = GUID. GUIDs are indexed similar to strings, so high row tables (over 50 million records) may need table partitioning or other performance techniques. SQL Server is getting extremely efficient, so performance concerns are less and less applicable.

  2. PK Guid is NON-Clustered index. Never cluster index a GUID unless it is NewSequentialID. But even then, a server reboot will cause major breaks in ordering.

  3. Add ClusterID Int to every table. This is your CLUSTERED Index... that orders your table.

  4. Joining on ClusterIDs (int) is more efficient, but I work with 20-30 million record tables, so joining on GUIDs doesn't visibly affect performance. If you want max performance, use the ClusterID concept as your primary key & join on ClusterID.

Here is my Email table...

CREATE TABLE [Core].[Email] (
    [EmailID]      UNIQUEIDENTIFIER CONSTRAINT [DF_Email_EmailID] DEFAULT (newsequentialid()) NOT NULL,        
    [EmailAddress] NVARCHAR (50)    CONSTRAINT [DF_Email_EmailAddress] DEFAULT ('') NOT NULL,        
    [CreatedDate]  DATETIME         CONSTRAINT [DF_Email_CreatedDate] DEFAULT (getutcdate()) NOT NULL,      

CREATE UNIQUE CLUSTERED INDEX [IX_Email_ClusterID] ON [Core].[Email] ([ClusterID])

CREATE UNIQUE NONCLUSTERED INDEX [IX_Email_EmailAddress] ON [Core].[Email] ([EmailAddress] Asc)

@Phil 2017-09-02 15:47:35

Could you explain the PK_Email constraint? Why you have ... NonClustered(EmailID ASC) instead of ...Nonclustered(ClusterID ASC) ?

@Robert J. Good 2017-09-03 16:28:57

You bet. Two main things going on with indexes: 1. Clustered on ClusterID - Orders your table on disk (0% fragmentation). 2. NonClustered on EmailID - Indexes the EmailID field to speed up GUID ID lookups. A GUID field lookup behaves string-ish, so a EmailID lookup would be slow without the index.

@Dale K 2019-07-05 02:55:27

@RobertJ.Good I've seen this method discussed before i.e. adding a surrogate int key to cluster on. But I can't find anywhere which shows the performance gain in having a surrogate key clustered index over using a heap. Do you have any links to benchmark data?

@Robert J. Good 2019-08-20 20:10:38

Hi @DaleBurrell, the clustered index is to prevent table fragmentation. Performance gain happens as the table naturally grows in order on disk, with low fragmentation.

@dariol 2019-12-05 09:34:53

@RobertJ.Good Is that a web application? What are you using in urls/hrefs? guid or int?

@Robert J. Good 2019-12-23 23:29:55

@dariol There are security implications, so drop the newsequentialid() and expose a NewId() Guid if no other choice (definitely not the Int.) I'd recommend a claims based and/or token approach, or even brute-force encryption for any identifiers. In short, avoid exposing any Ids, and avoid any value that can be guessed, or worse +1 to find the next record.

@jfrobishow 2020-01-28 22:40:52

@RobertJ.Good when you mention "In this distributed database world, it is absolutely the best way to merge distributed data." do you mean you eventually merge the records to a master database? Wondering what happens the the clusterID then, how do you handle duplicates once you merge the "source"?

@DaBlue 2019-04-15 17:10:59

Having sequential ID's makes it a LOT easier for a hacker or data miner to compromise your site and data. Keep that in mind when choosing a PK for a website.

@jonaglon 2020-01-28 10:03:38

Can you provide any logic or evidence to back up this claim? I'm struggling to see how a sequential id might compromise security.

@DaBlue 2020-01-28 15:29:32

Sure, if you know ID numbers are integer you can guess sequentially records in a DB. So if you query a single item, you can say that the next item is pk + 1. If you have random GUIDS, it will not follow a pattern. It would be nearly impossible to query other records than the one you previously queried (And know the PK).

@jonaglon 2020-01-29 09:37:51

If a hacker can query your database you're already compromised, I fail to see how sequential id's make the situation worse.

@DaBlue 2020-01-30 14:32:30

No. That is not true. I do a lot with pen testing and am well known for catching hackers. Do I always use GUIDs and not int, no. But if I need to protect data I will rely on data techniques to protect as well as programming and there's a LOT of reasons to do this. Take this older example. When you see a URL like that makes me cringe. The 1012 is the key of the record and can be switched out. With reactive sites this changes a bit but can still be seen. And it's complex to protect records at that point. Know your data and protect what needs to be protected.

@jonaglon 2020-01-30 15:27:46

If a user can switch out 1012 for another number and see data they shouldn't then there is a very serious security issue, that issue isn't caused by the primary key choice but it is exacerbated by it. I do take your point, thank you for spelling it out.

@DaBlue 2020-01-30 15:50:55

I highlight this as a requirement for any app that deals with hypersensitive data like HIPAA or SOX. Relying on only programming as a security restraint is dangerous. It's best to use multiple methods when protecting sensitive data.

@Panos Roditakis 2020-01-30 21:37:20

You may use a GUID to locate a record at the web page, that is not the PK of the table. Using query parameter in a website should not define how you structure your DB schema. The PK has nothing to do with input and parameters in UI or backend system.

@Asrar Ahmad Ehsan 2019-02-18 12:08:13

Most of the times it should not be used as the primary key for a table because it really hit the performance of the database. useful links regarding GUID impact on performance and as a primary key.


@EricImhauser 2017-05-12 08:14:15

I am currently developing an web application with EF Core and here is the pattern I use :

All my classes (tables) and an int PK and FK. I have got a additional column with the type Guid (generated by the c# constructor) with a non clustered index on it.

All the joins of table within EF is managed through the int keys while all the access from outside (controllers) are done with the Guids.

This solution allows to not show the int keys on urls but keep the model tidy and fast.

@Allen Wang 2018-08-02 20:07:31

Is there anything you need to do to configure the integer pK as clustered, like data annotations, or is it just automatically configured?

@Trong Phan 2019-05-09 17:50:27

What the name of the property do you use for Guid one?

@Matt 2012-08-13 16:22:51

This link says it better than I could and helped in my decision making. I usually opt for an int as a primary key, unless I have a specific need not to and I also let SQL server auto-generate/maintain this field unless I have some specific reason not to. In reality, performance concerns need to be determined based on your specific app. There are many factors at play here including but not limited to expected db size, proper indexing, efficient querying, and more. Although people may disagree, I think in many scenarios you will not notice a difference with either option and you should choose what is more appropriate for your app and what allows you to develop easier, quicker, and more effectively (If you never complete the app what difference does the rest make :).

P.S. I'm not sure why you would use a Composite PK or what benefit you believe that would give you.

@VAAA 2012-08-13 16:24:56

Totally agree!! But that means that if I have a GUID as PK or a Composite PK with GUID and other field is going to be the same right?

@Matt 2012-08-13 16:29:50

The PK (index) would be made up of the two columns, but unless you have some business specific reason for doing this, it seems unnecessary.

@Matt 2012-08-13 16:43:03

BTW this question is one of the most polarizing and debated questions out there and therefore extremely difficult to get an answer for that you will feel 100% comfortable with. Either method comes with trade-offs, so good luck :)

@marc_s 2012-08-13 16:34:59

GUIDs may seem to be a natural choice for your primary key - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table. What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to.

You really need to keep two issues apart:

  1. the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.

  2. the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.

By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way! I've personally seen massive performance gains when breaking up the previous GUID-based Primary / Clustered Key into two separate key - the primary (logical) key on the GUID, and the clustering (ordering) key on a separate INT IDENTITY(1,1) column.

As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.

Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.

Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.

Quick calculation - using INT vs. GUID as Primary and Clustering Key:

  • Base Table with 1'000'000 rows (3.8 MB vs. 15.26 MB)
  • 6 nonclustered indexes (22.89 MB vs. 91.55 MB)

TOTAL: 25 MB vs. 106 MB - and that's just on a single table!

Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.

PS: of course, if you're dealing with just a few hundred or a few thousand rows - most of these arguments won't really have much of an impact on you. However: if you get into the tens or hundreds of thousands of rows, or you start counting in millions - then those points become very crucial and very important to understand.

Update: if you want to have your PKGUID column as your primary key (but not your clustering key), and another column MYINT (INT IDENTITY) as your clustering key - use this:

 .... add more columns as needed ...... )



Basically: you just have to explicitly tell the PRIMARY KEY constraint that it's NONCLUSTERED (otherwise it's created as your clustered index, by default) - and then you create a second index that's defined as CLUSTERED

This will work - and it's a valid option if you have an existing system that needs to be "re-engineered" for performance. For a new system, if you start from scratch, and you're not in a replication scenario, then I'd always pick ID INT IDENTITY(1,1) as my clustered primary key - much more efficient than anything else!

@Andrew Theken 2014-02-26 15:15:33

This is a great answer, one thing I'd mention is that being able to generate the key before insert is frequently useful. Using "newsequentialid()" can help with the clustering, but that requires an additional round-trip to SQL. So another benefit of the "surrogate key" approach is that you can generate new ids, client-side, with fewer index fragmentation concerns.

@Fred Lackey 2014-07-15 12:52:46

Just curious. Would storing the GUID as a char(32) or char(36) PK solve this? Why / why not?

@marc_s 2014-07-15 13:41:23

@FredLackey: no - same problem - since the key is totally random, excessive index fragmentation will happen. Just don't do it.

@pinkfloydx33 2014-11-01 12:50:57

The way I read this is that having both a non clustered uniqueidentifier column and the int identity column, FK's should also be uniqueidentifier? If you do that, when would you actually use the identity column directly, or would you not?

@A_L 2015-02-06 11:54:09

@marc_s If my GUID pk is non-clustered and I use this to join my tables (for portability across databases) then the clustered int index is redundant right? Do you foresee any problem with having no unique clustered index and doing as I describe?

@marc_s 2015-02-06 12:52:59

@A_L: yes, a table without a clustering index is a heap - and that's really really bad for many reasons. Don't just toss your clustered index - it's important on so many levels!

@Nicolas Belley 2015-06-27 13:33:47

Little question, should the GUID now be used on joins, or the int id? My instinct tells me the GUID should be used, but I fail to see a technical problem using the int id...

@marc_s 2015-06-27 21:47:57

@NicolasBelley: the int is probably a bit more efficient, since it's 4x smaller in sheer size ...

@Nicolas Belley 2015-06-28 11:51:16

@marc_s but in a replication scenario, if the int column is identity, shouldn't we use the GUID since the int column can repeat itself across devices?

@Derek Greer 2015-07-22 21:44:59

Great information, but as with most things, the right choice depends upon the needs of your application. I do however feel like the article would be more balanced with discussion of the GuidComb strategy. The "cheap space isn't the point" article also has some great information, but I'm not sure the test scenarios are optimal for application developers. Using Identity over generating the key app side invariably leads to round-trips. I think comparisons around typical DDD object graph persistence scenarios would help give a more accurate picture to help with the decision process.

@Nick.McDermaid 2016-07-09 08:42:26

This is an old thread, but might I add: don't just use a useless arbitrary INT as the clustering key. Use something useful like an incerementing date that is actually searched on, that has some relation to the data you're storing. You only get one clustering key, and if you choose the right one you'll get good performance

@Kip ei 2017-09-25 13:25:10

@marc_s although outdated, I totally agree with Nick.McDemermaid, but maybe I am missing something?! I am very curious about you're opinion an what Nick.McDemermaid has to say!!

@marc_s 2017-09-25 13:27:41

@Kipei: the main issues is the I-F you have such a natural value - then yes, you can use it as a primary key. BUT: values like DATETIME for instance are NOT useful for a clustering key, since they have a 3.33ms accuracy only, and thus duplicates can exist. So in such a case, you *still need an INT IDENTITY instead - therefore, I typically use that by default, since frmo my 20+ years of experience, a really usable natural key hardly ever really exists ....

@Anyname Donotcare 2018-06-26 09:52:24

I currently work on EF6 Code first web application (sql server db 2012 0r 2017) and want to apply DDD concepts so I need a unique key in advance before inserting in DB, Many recommends UUID But I'm afraid of the performance issue, Could u help me please to take the right decision. Should I use GUID instead of auto increments key

@marc_s 2018-06-26 11:05:16

@AnynameDonotcare: I'd still strongly recommend using an auto-increment INT or BIGINT. What makes you think you must know the ID value before saving?? I fail to see how DDD or any other design approach mandates this......

@Anyname Donotcare 2018-06-26 11:10:40

I've learned that I should keep my model in a valid state and when creating a new instance of a class it's recomended to put all required attributes as constructor parameters. So I converted all my IDs to GUID instead of int, Should I use both of them ? one as auto increment to solve the technical issue and the GUID to solve the DDD issue ?

@Dale K 2019-07-05 02:58:23

@marc_s I've seen this method of adding a surrogate int key to cluster on discussed before. But I can't find anywhere which shows the performance gain in having a surrogate key clustered index over using a heap. Do you have any links to benchmark data? I know everyone says a heap is bad and performs worse, but also it seems there are many opinions out there.

@marc_s 2019-07-05 03:36:21

@DaleBurrell: check out… - and any blog post by Kimberly Tripp for that matter - she's the "Queen of indexing" and I'm pretty sure there's performance testing numbers in her blog somewhere to show this very clearly

@Dale K 2019-07-05 03:42:38

Thanks @marc_s - I have read most of her stuff and didn't see data to back that up. But will look again. on your link she says "Oh – and if you arbitrarily add a column to use for clustering (maybe not as the primary key) that can help" - but doesn't expand on that.

@AnandPhadke 2012-08-13 16:47:42

If you use GUID as primary key and create clustered index then I suggest use the default of NEWSEQUENTIALID() value for it

@genuinefafa 2020-05-17 19:26:56

why would you do that?

Related Questions

Sponsored Content

29 Answered Questions

1 Answered Questions

[SOLVED] Primary Key of Associative Entity

1 Answered Questions

9 Answered Questions

[SOLVED] Sqlite primary key on multiple columns

12 Answered Questions

1 Answered Questions

[SOLVED] UUDI/GUID primary key in MySQL

8 Answered Questions

1 Answered Questions

3 Answered Questions

[SOLVED] SQL Guid Primary Key Join Performance

Sponsored Content