By Mike Henderson


2018-04-20 14:24:42 8 Comments

I have an aggregate query that has a lot of columns and generate a huge data set:

SELECT      column1,
            column2,
            ...
            column20,
            sum(column21),
            ...
            sum(column40)
INTO        #Output
FROM        #Ledger
GROUP BY    column1,
            ...
            column20

The input table (#Ledger) has 18m rows and the result table (#Output) has 600k rows. The query took 9 minutes. Is there anything I can do to make it faster? Here's the execution plan:

execution plan

There's no index on the #Ledger table and the exclamation point in Sort gives the following warning:

Operator used tempdb to spill data during execution with spill level 1 and 4 spilled thread(s), Sort wrote 1541592 pages to and read 1541592 pages from tempdb with granted memory 3752160KB and used memory 3681824KB

2 comments

@David Browne - Microsoft 2018-04-20 15:33:07

Remove the unnecessary GROUP BY columns and create clustered index on the remaining ones.

What subset of those GROUP BY columns are necessary to uniquely identify a row in #OUTPUT? Keep those, and replace the others. Something like:

SELECT      column1,
            column2,
            column3,
            column4,
            max(column5) column5,
            ...
            max(column20) column20,
            sum(column21),
            ...
            sum(column40)
INTO        #Output
FROM        #Ledger
GROUP BY    column1,
            column2,
            column3,
            column4

@Mike Henderson 2018-04-20 15:46:44

An interesting idea. I will try this later today

@ypercubeᵀᴹ 2018-04-20 15:52:56

And the - chosen 4 (or n) columns - don't necessarily have to be UNIQUE. If all combination have small number of rows - and you use the original GROUP BY with the 20 columns - the clustered index will still be used effectively, with only the low overhead of sorting/separating each small group into subgroups.

@Erik Darling 2018-04-20 15:54:06

Looking at your query plan, it seems that the optimizer has chosen a Stream Aggregate operator to handle your GROUP BY. The downside of these is that they require ordered data, which a HEAP is not.

David's suggestion is perfectly valid. Adding a clustered index may remove the Sort operator, so performance testing that vs. my suggestion is up to you.

If indexing the temp table doesn't give you the performance you need, applying an OPTION(HASH GROUP); hint to the query will influence the optimizer to use a Hash Match Aggregate rather than a Stream Aggregate.

Hash Match Aggregates don't require sorted data, but may also spill to disk the way the Sort spilled in the Stream Aggregate plan.

Related Questions

Sponsored Content

1 Answered Questions

1 Answered Questions

[SOLVED] SQL Server : speed up this query

2 Answered Questions

[SOLVED] Query memory grant and tempdb spill

1 Answered Questions

1 Answered Questions

1 Answered Questions

1 Answered Questions

[SOLVED] Why are queries causing spill to tempdb?

1 Answered Questions

1 Answered Questions

[SOLVED] Workspace Memory Internals

Sponsored Content