By user261963


2015-05-06 16:23:59 8 Comments

I have two sets of earth measurements from satellite data, each with time fields (mjd for mean julian date) and geography positions (GeoPoint, spacial) and I'm looking for coincidences between the two sets such that their times match to a threshold of 3hrs (or .125 days) and their distances to within 200 km of each other.

I have made indexes for both the mjd fields on both tables and spatial tables.

When I just join on the time constraint, the database calculates 100,000 matches in 8 seconds and computes the distances for all 100,000 matches in that time. The query looks like this:

select top 100000 h.Time, m.Time, h.GeoPoint.STDistance(m.GeoPoint)/1000.0
from L2V5.dbo.header h join L2.dbo.MLS_Header m
on h.mjd between m.mjd-.125 and m.mjd+.125
option( table hint ( h, index(ix_MJD) ), table hint( m, index(ix_MJD) ) )

And the executed plan is:

Only mjd constraint

When sorted, 9 of the distances were under 200km, so there are matches. The trouble is, when I add the distance constraint and run this instead,

select top 10 h.Time, m.Time, h.GeoPoint.STDistance(m.GeoPoint)/1000.0
from L2V5.dbo.header h join L2.dbo.MLS_Header m
on h.mjd between m.mjd-.125 and m.mjd+.125
and h.GeoPoint.STDistance(m.GeoPoint)<200000
option( table hint ( h, index(ix_MJD) ), table hint( m, index(ix_MJD) ) )

it goes away for a long time. Obviously, in 8 seconds, it could find 100,000 time matches, 9 of which were under 200km, so the optimizer must be trying something sub-optimal. The plan looks similar to above with a filter on the distances (I'm guessing).

with spatial constrant, no spatial filter

I can force the use of the spatial index with this:

select top 5 h.Time, m.Time, h.GeoPoint.STDistance(m.GeoPoint)/1000.0 
from L2V5.dbo.header h join L2.dbo.MLS_Header m 
on h.GeoPoint.STDistance(m.GeoPoint)<200000
and h.mjd between m.mjd-.125 and m.mjd+.125 
option( table hint ( h, index(ix_MJD), index(ix_GeoPoint) ), table hint( m, index(ix_MJD) ) )

both constraints with both indexes

which then takes 3 minutes to find 5 matches.

How do I tell the query optimizer to use the MJD index seek first, and then the spatial index second (or is that what it is doing already) and is there any way I can help it out by telling it how many matches to expect? If it can compute 100,000 matches with distances in 8 seconds that has 9 under 200km, shouldn't the addition of the spatial index make it faster not slower?

Thanks for any other tips or ideas.

EDIT: To answer the question what the plan looks like without the hints, this (and it takes forever):

no hints

It maybe also worth mentioning that there are almost 1M records in the one table and 8M in the other

1 comments

@Rob Farley 2015-05-06 21:38:36

The problem is that it might (and knowing spatial indexes, probably will) assume that the spatial filter will be a lot more selective than the time filter.

But if you have a few million records within 200km, then it could be significantly worse.

You're asking it to find records within 200km, which returns data ordered by some spatial order. Finding the records in there that are close in time means checking each one.

Or else you're finding records by time, and you're getting results in time order. Then, filtering this list to the 200km radius is a matter of checking each one.

If you filter the data in two ranges like this, it becomes hard to apply the second filter using an index. You may be better off telling it not to use the spatial index if the time filter is the tighter one.

If both are large individually, and it's only together that they are tight, then you have a more complex issue, one that people have tried to solve for a long time, and which could be nicely resolved by indexes that cover 3D (and beyond) space. Except that SQL Server doesn't have them.

Sorry.

Edit: more info...

This is a similar problem to finding time ranges that cover a particular point in time. When you search for the records that start before that point, you then have an unordered mess of end times - and vice-versa. If you look for people in the phone book whose surnames start with F, you can't hope to find the people whose first names start with R very easily. And an index on first name doesn't help either for the same reason. Finding things in that next index is hard when your first index is not an equality.

Now, if you could change your date filter into an equality filter (or series of equality filters), then you could stand a chance, except that a spatial index is a special kind of index and can't be used as the second level in a composite index.

So you're left with an awkward situation, I'm afraid. :(

Edit: Try:

select top 100000 h.Time, m.Time, h.GeoPoint.STDistance(m.GeoPoint)/1000.0
from L2V5.dbo.header h join L2.dbo.MLS_Header m
on h.mjd between m.mjd-.125 and m.mjd+.125
where h.GeoPoint.STDistance(m.GeoPoint)/1000.0 < 200
option( table hint ( h, index(ix_MJD) ) );

Notice that I'm deliberately breaking the sargability by dividing by 1000 before comparing to 200. I want this work to be done in the Key Lookup.

Mind you, you could avoid the need for the lookups (and the hints) by INCLUDEing GeoPoint and Time in both ix_MJD indexes. That'll certainly take some of the heat out of the query plan.

@user261963 2015-05-06 22:41:25

I don't know if it changes anything, but the time filter is a lot more selective.

@Rob Farley 2015-05-07 01:46:34

Ok. So is it acceptable to locate all the time-matched rows and then check each location without the index?

@Rob Farley 2015-05-07 01:48:13

...so then the plan looks like your original one, but has an extra predicate or filter.

@Rob Farley 2015-05-07 08:32:02

Suggested some changes with a quick edit. You don't need to hint about m, just h. Although if you can swap which one you're adding 1/8 to, to make sure you're modifying the column from the smaller table and using those values to seek into the larger one, that'll help too. If h is 8M and m is 1M, leave the BETWEEN predicate, and hint for just h. If it's the other way round, change your predicate and hint (but better than changing the hint is to add those columns to your index).

@user261963 2015-05-08 17:00:53

Taking out all table hints seems to work best in the end, so long as I do h between m and not the other way around. The query no longer uses the GeoPoint indexes at all, but it wasn't using them efficiently anyway. I included the GeoPoint column to the MJD index and that helped a lot. select top 10000 h.Time, m.Time, m.GeoPoint.STDistance(h.GeoPoint), h.mjd-m.mjd from L2V5.dbo.header h join L2.dbo.MLS_Header m on m.GeoPoint.STDistance(h.GeoPoint)<200000 and m.mjd between h.mjd-.125 and h.mjd+.125 order by h.mjd

@Rob Farley 2015-05-08 20:29:00

Terrific. If you could add mjd and time to the spatial index, you might find that it started using that instead, which is where I figured breaking the sargability would help too. But the most significant thing is removing the lookups. I'm glad it helped. :)

Related Questions

Sponsored Content

1 Answered Questions

2 Answered Questions

[SOLVED] How to optimize inner join query on one large table and one small table

  • 2019-11-05 15:18:01
  • Anthony
  • 37 View
  • 0 Score
  • 2 Answer
  • Tags:   postgresql index

4 Answered Questions

[SOLVED] How to optimize STDistance execution?

1 Answered Questions

1 Answered Questions

1 Answered Questions

[SOLVED] LEFT JOIN on BETWEEN not using indexes

1 Answered Questions

1 Answered Questions

[SOLVED] How to optimize a very slow query with joins and group by?

Sponsored Content