Geospatial implementation

I've had a quick troll through the source on GitHub and found the code
for the GeoDistanceFilter but my familiarity with Lucene innards is
poor. My main concern is with large geo datasets in ES and how
performant that might be (or not). At a high level, how are the
geospatial filters implemented? Most specifically I'm interested in
the geo_distance and geo_distance_range filters.

Thanks,

Josh

Geo distance checks are done by loading all the geo points for the specific
field to memory, and doing in memory distance checks for all the documents
that matched your "main query". This means that with more results matching
the query, the more computational heavy it will become.

On Tue, Sep 13, 2011 at 12:31 PM, Josh Devins info@joshdevins.net wrote:

I've had a quick troll through the source on GitHub and found the code
for the GeoDistanceFilter but my familiarity with Lucene innards is
poor. My main concern is with large geo datasets in ES and how
performant that might be (or not). At a high level, how are the
geospatial filters implemented? Most specifically I'm interested in
the geo_distance and geo_distance_range filters.

Thanks,

Josh

So where there is no "main query", we essentially load up all
documents' geo points into memory and filter/do distance checks after
the real Lucene query has been fired. Take for example the following
query, which I assume because of "match_all" acts as described above.

"filtered": {
    "query": {
        "match_all": {}
    },
    "filter": {
        "geo_distance": {
            "distance": "5km",
            "place.location": {
                "lat": 10,
                "lon": 70
            }
        }
    }
}

All field values are loaded into memory and persisted there, regardless of
the query. And yes, on a match all query, it will be (in memory) checked
against all docs. In master, this will be faster (for distance) since it
will do bounding box checks first.

On Tue, Sep 13, 2011 at 1:16 PM, Josh Devins info@joshdevins.net wrote:

So where there is no "main query", we essentially load up all
documents' geo points into memory and filter/do distance checks after
the real Lucene query has been fired. Take for example the following
query, which I assume because of "match_all" acts as described above.

"filtered": {
"query": {
"match_all": {}
},
"filter": {
"geo_distance": {
"distance": "5km",
"place.location": {
"lat": 10,
"lon": 70
}
}
}
}

Great to know about the bounding-box checking on master. I was going
to just add a geohash that can be part of the main query to filter the
list down before going into memory and doing the distance checks. I
assume this would have the same/similar effect as what's on master?

The bounding box check is also done in memory, how were you thinking of
adding the geo hash based check? Using a term query / filter?

On Tue, Sep 13, 2011 at 1:41 PM, Josh Devins info@joshdevins.net wrote:

Great to know about the bounding-box checking on master. I was going
to just add a geohash that can be part of the main query to filter the
list down before going into memory and doing the distance checks. I
assume this would have the same/similar effect as what's on master?

I haven't played with it yet, but probably with a term query. This
should then return at least a subset of results rather than the entire
corpus, that need to be dealt with in memory. The geohash in the query
would be much shorter than the high-precision geohash derived from the
lat/lon stored in the index/with the document. This is basically a
bounding box search using Lucene to do the heavy lifting first.

At least, that was the plan! Happy to hear your thoughts.

On second thought, I think geohash might be more than necessary. You
mention that what is in master right now relies on first doing a
bounding box filter, then the range filter. Why not use a range query
on the minimum bounding box of the geo_distance radius to narrow this
down in Lucene first, then what you have in memory you can just apply
the regular geo_distance filter? Not sure about performance, but it
would then not require a pre-calculated geohash in the index.

We can definitely do that, its actually something that I want to do (once we
can derive the bounding box from the distance). Its the same question
between "range" filter and "numeric_range" filter. The first uses the index
data to do the range check, the second does only in memory range checks.
Some times the fist one is faster, and sometimes the second one is.

For most typical cases, where the geo functionality is driven by a main
query that already narrows down the search, the in memory checks will be
faster. It might even be faster in the match_all case (depends on the
distribution of data).

But, we can definitely have an option to do "indexed" based range checks for
bounding box (will require also indexing the lat/lon). Note, this will not
work with multiple locations per doc (unless nested mapping is used). Can
you open an issue?

On Tue, Sep 13, 2011 at 5:18 PM, Josh Devins info@joshdevins.net wrote:

On second thought, I think geohash might be more than necessary. You
mention that what is in master right now relies on first doing a
bounding box filter, then the range filter. Why not use a range query
on the minimum bounding box of the geo_distance radius to narrow this
down in Lucene first, then what you have in memory you can just apply
the regular geo_distance filter? Not sure about performance, but it
would then not require a pre-calculated geohash in the index.

Ok, went ahead and pushed support for using indexed lat/lon to filer geo
locations, depending on your dataset distribution, it might prove faster:
Allow to filter geo bounding box or distance based on indexed lat lon · Issue #1334 · elastic/elasticsearch · GitHub.

On Tue, Sep 13, 2011 at 11:26 PM, Shay Banon kimchy@gmail.com wrote:

We can definitely do that, its actually something that I want to do (once
we can derive the bounding box from the distance). Its the same question
between "range" filter and "numeric_range" filter. The first uses the index
data to do the range check, the second does only in memory range checks.
Some times the fist one is faster, and sometimes the second one is.

For most typical cases, where the geo functionality is driven by a main
query that already narrows down the search, the in memory checks will be
faster. It might even be faster in the match_all case (depends on the
distribution of data).

But, we can definitely have an option to do "indexed" based range checks
for bounding box (will require also indexing the lat/lon). Note, this will
not work with multiple locations per doc (unless nested mapping is used).
Can you open an issue?

On Tue, Sep 13, 2011 at 5:18 PM, Josh Devins info@joshdevins.net wrote:

On second thought, I think geohash might be more than necessary. You
mention that what is in master right now relies on first doing a
bounding box filter, then the range filter. Why not use a range query
on the minimum bounding box of the geo_distance radius to narrow this
down in Lucene first, then what you have in memory you can just apply
the regular geo_distance filter? Not sure about performance, but it
would then not require a pre-calculated geohash in the index.

That's great! Will have a look soon. Thanks for the quick fix!

Josh

That's great! Will have a look soon. Thanks for the quick fix!

Josh

I'm wondering how spatial search works with EleasticSearch. This old
contrib/plugin for early Solr worked VERY well:

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201012.mbox/<AANLkTi=8poOPe1V9zJcwWbDQcYNyz4bG2U6ysa+ru=xU@mail.gmail.com>
http://info.dutchworks.nl/sspcontentrequestdw.html

It created dynamic fields during indexing that allowed the bounding box /
radius searches to be VERY fast. It ran a polygonal analysis, I think,
against the geolocation, and put the results in the dynamic fields.

On Wednesday, 14 September 2011 11:29:13 UTC-5, Josh Devins wrote:

That's great! Will have a look soon. Thanks for the quick fix!

Josh

I glanced over the implementation are one point, in elasticsearch, you have both available options, either do in memory bounding box checks, or indexed bounding box checks. Distance checks were always done in memory, though it optimizes the heavy distance calculation by first checking if it falls in a bounding box.

On Sunday, March 4, 2012 at 8:53 PM, Dennis wrote:

I'm wondering how spatial search works with EleasticSearch. This old contrib/plugin for early Solr worked VERY well:

 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201012.mbox/%3CAANLkTi=8poOPe1V9zJcwWbDQcYNyz4bG2U6ysa+ru=xU@mail.gmail.com%3E
 http://info.dutchworks.nl/sspcontentrequestdw.html

It created dynamic fields during indexing that allowed the bounding box / radius searches to be VERY fast. It ran a polygonal analysis, I think, against the geolocation, and put the results in the dynamic fields.

On Wednesday, 14 September 2011 11:29:13 UTC-5, Josh Devins wrote:

That's great! Will have a look soon. Thanks for the quick fix!

Josh