Thursday, October 23, 2008

Nearest Neighbour in SQL Server 2005

Actually this would work for any database with a little adjustment to the SQL.

It's all very well having a lot of points and the distances to them, but what about finding which of those points we're actually close to? There's a bunch of algorithms which mostly involve running through a massive list of points until you've worked out which is the shortest, by applying Pythagoras' theorem to the difference between the points. This is the most common, and can be reached by simply reprojecting the map into a grid and selecting the result of the still squared phythagorean equation, then ordering by that column:

SELECT TOP 1 NodeID, POWER(Northing - <actualNorthing>, 2) + POWER(Easting - <actualEasting>, 2) AS result FROM NodeTable ORDER BY result;

We're in the UK so we've used OSGB as our grid for the time being. There's plenty of other grid systems available such as UTM (Universal Transverse Mercator) and various other grids for other countries or continents. As it is, we'd have to import new data into our database if we did ship out to another country so the issue probably won't arise for us until we have someone who wants to work across a country border. Watch this space - we have Ireland to tackle yet :)

There is immediately an obvious optimisation here. England is very densely packed, and it's unlikely that you'll ever be on a road and be further than 2.5km from the next junction. (I actually ran a quick query on our data and the longest road we've got is just under 5km, but that may well be a bendy road.) So, with an index on Northing and Easting, we can put a bounding box around our actual point:

SELECT TOP 1 NodeID, POWER(Northing - <actualNorthing>, 2) + POWER(Easting - ,2) AS result FROM NodeTable WHERE Northing > <actualnorthing> - 1250 AND Northing < <actualnorthing> + 1250 AND Easting > <actualEasting> - 1250 AND Easting < <actualEasting> + 1250 ORDER BY result;

(As an aside, to make this more generic we could easily select the max roadlength if we have that data available - we do, so it's something we may look at in the future.)

Now we have the squared distance to the nearest node and that node's ID. We can, of course, then make this select into a subselect and in the outer, calculate the square root of the distance. Then we know roughly how far the actual point is from the node we've selected.

There are other optimisations for this algorithm. One involves starting with the bounding box, counting the points, taking the standard deviation and creating a new bounding box, then iterating through this until one remains. A short paper can be read here (PDF format). Another takes a small box and enlarges it until candidate nodes are found, then uses Pythagoras' theorem again. There is an MSDN blog I found after writing my SQL Server 2005 solution, which was funnily enough only published today: Linkage here. This one is optimised for SQL Server 2008 Spatial, which I'd love to get my teeth into.

No comments: