Ah, okay. For my test case, the indexed bounding box optimization turns out
to be an order of magnitude slower than simple memory. What if I add a
geohash for each point -- is there a way to use that for bounding box
On Tue, Mar 6, 2012 at 11:33 AM, Shay Banon email@example.com wrote:
The way optimize bbox works is that it runs a Lucene range filter on the
lat and lon, intercepts it, and uses the result to filter out the documents
to use. This might be slower compared to simple in memory checks, really
depends on your usage. How many nodes do you have? 10 shards sounds enough,
note that a even a single shard can easily handle concurrent search
On Tuesday, March 6, 2012 at 10:18 AM, Nick Dimiduk wrote:
I do expect geo_distance to be the primary query pattern and it should be
executed exclusively as a filter. I upgraded my cluster from m1.large to
c1.xlarge instances and that helped a lot! After that, I rebuilt my index
(10 shards + 1 replica each) with lat_lon:true and tried out
the optimize_bbox = indexed. This caused latencies to sky-rocket. I as much
better off w/o that bit enabled. The index is configured to run entirely in
memory; why is the indexed bbox so detrimental?
Tomorrow I'll try increasing my shard count even higher.
On Sat, Mar 3, 2012 at 1:43 PM, Shay Banon firstname.lastname@example.org wrote:
The way geo_distance executes by default is by doing in memory checks
to see if it fits or not (and, by default, optimizing to do a preliminary
bounding box check to reduce the heavy distance calc). The bounding box
check can also be done on indexed data and not in memory, and might make
sense in your case (mainly make sense when the filter is the main one
executed, without any additional filtering going on). See more in the docs
for the distance filter and the type.
What I would suggest is test, its hard to give suggestions. What you
will need though, is two things. First is to have enough shards to
accommodate the data (without knowing the specifics of the data, the
default 5 will do), and have large machines since CPU is going to be a
factor (in geo case). The good think is that you can always add machines if
you need more search power, and even if you get to 10 (5 shards + 1 replica
each), you can always dynamically increase the number of replicas (and
machines) to add more search capacity.
On Thursday, March 1, 2012 at 9:17 PM, Nick Dimiduk wrote:
I'm interested in using ElasticSearch as a distributed geospatial
index. I have roughly 2 million records, each with a single "pin". My
primary query pattern will be geo_distance, perhaps with filters on
other fields. I will be deployed in EC2.
Let's say I want to support 8k-10k concurrent queries with reasonable
(sub-second) latencies. Can you give me some advice as to what a
properly provisioned cluster for supporting this load might look like?
How many nodes of what instance type? What kinds of shard-to-node-to-
replica ratios should I expect? Should I keep the indexes in memory or
is there benefit in leaving them on disk? Would such a cluster benefit
from node configuration tweaks (designating some as non-data nodes,
What are the questions I'm not asking which I should be asking? I'm
not expecting exact numbers, but orders-of-magnitude would be nice.
Thanks in advance,