Spatial Similarity

loff · November 6, 2017, 10:02am

Hi,

I was wondering if there are any plugins / methods for calculating the similarity of spatial features. I'm planning to build a recommendation system for spatial data. The data I have consists of points, polygons and linestrings.

I know there is the geo_shape query, and i think the intersects relation is a good start to find related documents. However, there are cases where e.g. two features are very close but do not intersect. Additionally, I would like to have some sort of ranking.

The geohash aggregation is also interesting, as similar geohashes mean the data is from the same bucket. But there are also edge cases where two features are related but not in the same bucket:

Nearby locations generally have similar prefixes, though not always: there are edge-cases straddling large-cell boundaries; in France, La Roche-Chalais (u000) is just 30km from Pomerol (ezzz). A reliable prefix search for proximate locations will also search prefixes of a cell’s 8 neighbours. (e.g. a database query for results within 30-odd kilometres of Pomerol would be SELECT * FROM MyTable WHERE LEFT(Geohash, 4) IN ('ezzz', 'gbpb, 'u000', 'spbp', 'spbn', 'ezzy', 'ezzw', 'ezzx', 'gbp8'). Whether this would offer significant (or any) performance gains over a latitude/longitude bounding box search I’ve yet to check.

From http://www.movable-type.co.uk/scripts/geohash.html

Best,
lukas

Mark_Harwood · November 10, 2017, 3:48pm

I've used a custom analyzer with Lucene before now that used JTS and Geohash to describe a shape using a fixed number of geohashes (e.g. 100).
The encoding approach is to start with top-level geohashes that intersect e.g. e and f and then drop-down to second-level geohashes e.g. ea, and fz, then you have 96 geohashes left to continue this process. At the final stages of the process you have more detailed levels left to describe than you have geohashes remaining to spend describing them. At this point you can just spread the choice of remaining geohashes evenly across the remaining geohashes to be described at the next level.
This means you have scale-dependent detailing of shape coverage - you spend more geohashes describing where there is more detail. You do the same thing for docs and queries and Lucene's default ranking algos do the rest.

Doc shapes that share many geohashes with query shapes are ranked higher (coord scoring).
Detailed geohashes are more important matches than upper-level geohashes (IDF scoring).

The tendency is to pull in scale-similar shapes that overlap. I used it as Google Earth network plugin that pulled in zoom-dependent shapes e.g. the Isle of Wight isn't loaded until you zoom in tight over the UK.

system · December 8, 2017, 3:48pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.