Friday, October 24, 2008

Pythagoras and the OSGB coordinate system

We've decided that, since sinϑ is very close to ϑ for very small values of ϑ, and we're only ever dealing with very small values, we'll simply use Pythagoras with the Latitude & Longitude. Which avoids any problems with operating internationally. That is, in areas within other countries where the areas are small enough for the accuracy lost to be immeasurable and/or irrelevant. We're not routeplanning or performing highly critical analasys, after all.

Google Maps WTF Moment

What the hell is this thing? There appears to be a sign beside it. Searching for ["Guys head" crater] did not yield the results I expected...

Thursday, October 23, 2008

Nearest Neighbour in SQL Server 2005

Actually this would work for any database with a little adjustment to the SQL.

It's all very well having a lot of points and the distances to them, but what about finding which of those points we're actually close to? There's a bunch of algorithms which mostly involve running through a massive list of points until you've worked out which is the shortest, by applying Pythagoras' theorem to the difference between the points. This is the most common, and can be reached by simply reprojecting the map into a grid and selecting the result of the still squared phythagorean equation, then ordering by that column:

SELECT TOP 1 NodeID, POWER(Northing - <actualNorthing>, 2) + POWER(Easting - <actualEasting>, 2) AS result FROM NodeTable ORDER BY result;

We're in the UK so we've used OSGB as our grid for the time being. There's plenty of other grid systems available such as UTM (Universal Transverse Mercator) and various other grids for other countries or continents. As it is, we'd have to import new data into our database if we did ship out to another country so the issue probably won't arise for us until we have someone who wants to work across a country border. Watch this space - we have Ireland to tackle yet :)

There is immediately an obvious optimisation here. England is very densely packed, and it's unlikely that you'll ever be on a road and be further than 2.5km from the next junction. (I actually ran a quick query on our data and the longest road we've got is just under 5km, but that may well be a bendy road.) So, with an index on Northing and Easting, we can put a bounding box around our actual point:

SELECT TOP 1 NodeID, POWER(Northing - <actualNorthing>, 2) + POWER(Easting - ,2) AS result FROM NodeTable WHERE Northing > <actualnorthing> - 1250 AND Northing < <actualnorthing> + 1250 AND Easting > <actualEasting> - 1250 AND Easting < <actualEasting> + 1250 ORDER BY result;

(As an aside, to make this more generic we could easily select the max roadlength if we have that data available - we do, so it's something we may look at in the future.)

Now we have the squared distance to the nearest node and that node's ID. We can, of course, then make this select into a subselect and in the outer, calculate the square root of the distance. Then we know roughly how far the actual point is from the node we've selected.

There are other optimisations for this algorithm. One involves starting with the bounding box, counting the points, taking the standard deviation and creating a new bounding box, then iterating through this until one remains. A short paper can be read here (PDF format). Another takes a small box and enlarges it until candidate nodes are found, then uses Pythagoras' theorem again. There is an MSDN blog I found after writing my SQL Server 2005 solution, which was funnily enough only published today: Linkage here. This one is optimised for SQL Server 2008 Spatial, which I'd love to get my teeth into.

Change of plan...

Well, our really good idea was to use agents to roam the network and build a massive grid of Point to Point distances. After some thought, we abandoned that and learnt about Dijkstra's algorithm. Then we wrote our own implementation, which in C# on a decent box takes about 2500ms to find the distance (or time) from one point in Southwest England to every other point in the same area. This is absolutely perfect - we can cache the data against each static location and determine the approximate time to reach it from anywhere. If we make the assumption that we don't ever want to consider drivers further than 20 miles away, we're probably looking at a 10ms hit when we come to use it in reality.

Friday, October 17, 2008

Geographic routing & speed optimisation

Up until now we've been using MapPoint to get routes & distances between places, the data from which we use to create prices based on tariffs which can be fairly complex. Unfortunately making calls to MapPoint can be incredibly slow, and we only have MapPoint 2007 which means it won't contain any roads built in the past 18 months or so. I'm not sure there is a MapPoint 2008, but even if there is, we don't want it any more. In terms of licensing, it's not viable for us to go out to a web service as we will be making hundreds of thousands of routing queries each day and companies who supply such services charge by the hit, or charge huge sums based on the type of business. Ours, unfortunately, seems to come into the category of Asset Tracking which demands truly extortionate sums for a business like ours. We're also currently paying quite a lot for the Postcode Area File from the Post Office, which would need to be licensed on a per-site basis.

So we've got hold of the UK data from TeleAtlas which has a royalty-free license and we're now working on presenting the map data and creating routes. Address finding will come later, but it should be fairly simple.

Finding the shortest path is a very well documented problem. There are also plenty of very well documented algorithms and we may actually be looking at the Canadial Traveller Problem, where we know the position we want to get to, but we don't know how to get there and we don't know if all the roads on the route are open at that time.

However, these algorithms and problems are all approached on a per-instance basis. In general you're expected to want to find the length of the shortest path at the time that you want the information.

We've decided to take a different approach to the problem which involves mapping (in a grid) the time and distance from each node in our graph to every other node. The UK has a huge number of nodes, however, so the graph we map will be based on the bounding box around a local area with some further allowance for roads leading into, out of and around the area.

This does at first sound like a very time-intensive task, but we're planning to analyse the graph from multiple machines at once. We have SQL Server 2008, so having multiple concurrent connections won't be a problem.

I'll post more details on our method at a later date. Right now I have code to write :)

Thursday, October 9, 2008

Thought it was time I started a "public" blog, so here it is :) Not entirely sure what I'll post here - it just won't be rants or private things. Which probably doesn't leave much!