I am doing some proof of concepts for using ES 0.90.7 to do some indexing of spatial data that is currently being stored in PostGIS. I have created a 6 node cluster with 16G RAM and 16 cpus on each node, with a 2 shard index and have indexed 45M records in ES from PostGIS. My mapping is as follows (other details left out for brevity):
"mappings" : {
"well_heads" : {
"properties" : {
"location" : {
"type" : "geo_point",
"lat_lon" : "true",
"geohash" : "true",
"geohash_prefix" : "true",
"geohash_precision" : 10
}
}
}
}
We do a lot queries like 'give me all wellheads that are within 1km of a school' and have created a simple ES query to answer this question:
{
"query" : {
"filtered" : {
"query" : {
"match-all" : {}
},
"filter" : {
"geo_distance" : {
"distance_type" : "plane",
"distance" : "1km",
"well_heads.location" : [76.987, 38.987]
}
}
}
}
}
This returned in about 30 seconds which I was expecting to be much faster. I began experimenting with a geohash_cell filter that returned in about 20mS (cold) for a similar number of hits (~ 22k). I combined the geohash_cell with a _geo_distance sort that executed in 19s which made me think that the processing time may be due to the calculation of distances. Is this correct or am I missing something obvious? Even when I went to a location where there was a single hit, the time was the same. This made me think that perhaps ES is doing an unbounded query against all the data.
I would love to have ES outperform PG. FWIW, the PG server is on a single machine that is 1/2 as beefy as the ES cluster and it returns in 365mS cold. I've played with different shard sizes, replicas, etc.
Thanks in advance.
Regards,
Eric