Geo-Distance Throughput Expectations

Hi all,

I'm interested in using ElasticSearch as a distributed geospatial
index. I have roughly 2 million records, each with a single "pin". My
primary query pattern will be geo_distance, perhaps with filters on
other fields. I will be deployed in EC2.

Let's say I want to support 8k-10k concurrent queries with reasonable
(sub-second) latencies. Can you give me some advice as to what a
properly provisioned cluster for supporting this load might look like?
How many nodes of what instance type? What kinds of shard-to-node-to-
replica ratios should I expect? Should I keep the indexes in memory or
is there benefit in leaving them on disk? Would such a cluster benefit
from node configuration tweaks (designating some as non-data nodes,
for instance)?

What are the questions I'm not asking which I should be asking? I'm
not expecting exact numbers, but orders-of-magnitude would be nice.

Thanks in advance,
-n

Hi,

The way geo_distance executes by default is by doing in memory checks to see if it fits or not (and, by default, optimizing to do a preliminary bounding box check to reduce the heavy distance calc). The bounding box check can also be done on indexed data and not in memory, and might make sense in your case (mainly make sense when the filter is the main one executed, without any additional filtering going on). See more in the docs for the distance filter and the type.

What I would suggest is test, its hard to give suggestions. What you will need though, is two things. First is to have enough shards to accommodate the data (without knowing the specifics of the data, the default 5 will do), and have large machines since CPU is going to be a factor (in geo case). The good think is that you can always add machines if you need more search power, and even if you get to 10 (5 shards + 1 replica each), you can always dynamically increase the number of replicas (and machines) to add more search capacity.

On Thursday, March 1, 2012 at 9:17 PM, Nick Dimiduk wrote:

Hi all,

I'm interested in using Elasticsearch as a distributed geospatial
index. I have roughly 2 million records, each with a single "pin". My
primary query pattern will be geo_distance, perhaps with filters on
other fields. I will be deployed in EC2.

Let's say I want to support 8k-10k concurrent queries with reasonable
(sub-second) latencies. Can you give me some advice as to what a
properly provisioned cluster for supporting this load might look like?
How many nodes of what instance type? What kinds of shard-to-node-to-
replica ratios should I expect? Should I keep the indexes in memory or
is there benefit in leaving them on disk? Would such a cluster benefit
from node configuration tweaks (designating some as non-data nodes,
for instance)?

What are the questions I'm not asking which I should be asking? I'm
not expecting exact numbers, but orders-of-magnitude would be nice.

Thanks in advance,
-n

So filters are in the dataset retrieval, before the set is in memory?

If I've got machines all over the world, and each of them is geo centric (as well as language/locale centric), is there anyway to cause the sharding/clustering to be transparent from any machine in the world for the site, but the data and searches automatically only get directed to machines/clusters that contain the logical geographical location and locale where data would be stored?

Its recommended to have a cluster per location, since latency between machines in different locations is going to be a factor.

On Sunday, March 4, 2012 at 9:01 PM, gearond wrote:

So filters are in the dataset retrieval, before the set is in memory?

If I've got machines all over the world, and each of them is geo centric (as
well as language/locale centric), is there anyway to cause the
sharding/clustering to be transparent from any machine in the world for the
site, but the data and searches automatically only get directed to
machines/clusters that contain the logical geographical location and locale
where data would be stored?

--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Geo-Distance-Throughput-Expectations-tp3791478p3798715.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com (http://Nabble.com).

Thanks Shay.

I do expect geo_distance to be the primary query pattern and it should be
executed exclusively as a filter. I upgraded my cluster from m1.large to
c1.xlarge instances and that helped a lot! After that, I rebuilt my index
(10 shards + 1 replica each) with lat_lon:true and tried out
the optimize_bbox = indexed. This caused latencies to sky-rocket. I as much
better off w/o that bit enabled. The index is configured to run entirely in
memory; why is the indexed bbox so detrimental?

Tomorrow I'll try increasing my shard count even higher.

Thanks,
-n

On Sat, Mar 3, 2012 at 1:43 PM, Shay Banon kimchy@gmail.com wrote:

Hi,

The way geo_distance executes by default is by doing in memory checks
to see if it fits or not (and, by default, optimizing to do a preliminary
bounding box check to reduce the heavy distance calc). The bounding box
check can also be done on indexed data and not in memory, and might make
sense in your case (mainly make sense when the filter is the main one
executed, without any additional filtering going on). See more in the docs
for the distance filter and the type.

What I would suggest is test, its hard to give suggestions. What you
will need though, is two things. First is to have enough shards to
accommodate the data (without knowing the specifics of the data, the
default 5 will do), and have large machines since CPU is going to be a
factor (in geo case). The good think is that you can always add machines if
you need more search power, and even if you get to 10 (5 shards + 1 replica
each), you can always dynamically increase the number of replicas (and
machines) to add more search capacity.

On Thursday, March 1, 2012 at 9:17 PM, Nick Dimiduk wrote:

Hi all,

I'm interested in using Elasticsearch as a distributed geospatial
index. I have roughly 2 million records, each with a single "pin". My
primary query pattern will be geo_distance, perhaps with filters on
other fields. I will be deployed in EC2.

Let's say I want to support 8k-10k concurrent queries with reasonable
(sub-second) latencies. Can you give me some advice as to what a
properly provisioned cluster for supporting this load might look like?
How many nodes of what instance type? What kinds of shard-to-node-to-
replica ratios should I expect? Should I keep the indexes in memory or
is there benefit in leaving them on disk? Would such a cluster benefit
from node configuration tweaks (designating some as non-data nodes,
for instance)?

What are the questions I'm not asking which I should be asking? I'm
not expecting exact numbers, but orders-of-magnitude would be nice.

Thanks in advance,
-n

The way optimize bbox works is that it runs a Lucene range filter on the lat and lon, intercepts it, and uses the result to filter out the documents to use. This might be slower compared to simple in memory checks, really depends on your usage. How many nodes do you have? 10 shards sounds enough, note that a even a single shard can easily handle concurrent search requests.

On Tuesday, March 6, 2012 at 10:18 AM, Nick Dimiduk wrote:

Thanks Shay.

I do expect geo_distance to be the primary query pattern and it should be executed exclusively as a filter. I upgraded my cluster from m1.large to c1.xlarge instances and that helped a lot! After that, I rebuilt my index (10 shards + 1 replica each) with lat_lon:true and tried out the optimize_bbox = indexed. This caused latencies to sky-rocket. I as much better off w/o that bit enabled. The index is configured to run entirely in memory; why is the indexed bbox so detrimental?

Tomorrow I'll try increasing my shard count even higher.

Thanks,
-n

On Sat, Mar 3, 2012 at 1:43 PM, Shay Banon <kimchy@gmail.com (mailto:kimchy@gmail.com)> wrote:

Hi,

The way geo_distance executes by default is by doing in memory checks to see if it fits or not (and, by default, optimizing to do a preliminary bounding box check to reduce the heavy distance calc). The bounding box check can also be done on indexed data and not in memory, and might make sense in your case (mainly make sense when the filter is the main one executed, without any additional filtering going on). See more in the docs for the distance filter and the type.

What I would suggest is test, its hard to give suggestions. What you will need though, is two things. First is to have enough shards to accommodate the data (without knowing the specifics of the data, the default 5 will do), and have large machines since CPU is going to be a factor (in geo case). The good think is that you can always add machines if you need more search power, and even if you get to 10 (5 shards + 1 replica each), you can always dynamically increase the number of replicas (and machines) to add more search capacity.

On Thursday, March 1, 2012 at 9:17 PM, Nick Dimiduk wrote:

Hi all,

I'm interested in using Elasticsearch as a distributed geospatial
index. I have roughly 2 million records, each with a single "pin". My
primary query pattern will be geo_distance, perhaps with filters on
other fields. I will be deployed in EC2.

Let's say I want to support 8k-10k concurrent queries with reasonable
(sub-second) latencies. Can you give me some advice as to what a
properly provisioned cluster for supporting this load might look like?
How many nodes of what instance type? What kinds of shard-to-node-to-
replica ratios should I expect? Should I keep the indexes in memory or
is there benefit in leaving them on disk? Would such a cluster benefit
from node configuration tweaks (designating some as non-data nodes,
for instance)?

What are the questions I'm not asking which I should be asking? I'm
not expecting exact numbers, but orders-of-magnitude would be nice.

Thanks in advance,
-n

So does it handle instances where either of the poles are in the bounding box?

Does it subdivide the box into several trapezoids as it gets closer to the poles?

Sent from Yahoo! Mail on Android

Ah, okay. For my test case, the indexed bounding box optimization turns out
to be an order of magnitude slower than simple memory. What if I add a
geohash for each point -- is there a way to use that for bounding box
optimization instead?

Thanks,
-n

On Tue, Mar 6, 2012 at 11:33 AM, Shay Banon kimchy@gmail.com wrote:

The way optimize bbox works is that it runs a Lucene range filter on the
lat and lon, intercepts it, and uses the result to filter out the documents
to use. This might be slower compared to simple in memory checks, really
depends on your usage. How many nodes do you have? 10 shards sounds enough,
note that a even a single shard can easily handle concurrent search
requests.

On Tuesday, March 6, 2012 at 10:18 AM, Nick Dimiduk wrote:

Thanks Shay.

I do expect geo_distance to be the primary query pattern and it should be
executed exclusively as a filter. I upgraded my cluster from m1.large to
c1.xlarge instances and that helped a lot! After that, I rebuilt my index
(10 shards + 1 replica each) with lat_lon:true and tried out
the optimize_bbox = indexed. This caused latencies to sky-rocket. I as much
better off w/o that bit enabled. The index is configured to run entirely in
memory; why is the indexed bbox so detrimental?

Tomorrow I'll try increasing my shard count even higher.

Thanks,
-n

On Sat, Mar 3, 2012 at 1:43 PM, Shay Banon kimchy@gmail.com wrote:

Hi,

The way geo_distance executes by default is by doing in memory checks
to see if it fits or not (and, by default, optimizing to do a preliminary
bounding box check to reduce the heavy distance calc). The bounding box
check can also be done on indexed data and not in memory, and might make
sense in your case (mainly make sense when the filter is the main one
executed, without any additional filtering going on). See more in the docs
for the distance filter and the type.

What I would suggest is test, its hard to give suggestions. What you
will need though, is two things. First is to have enough shards to
accommodate the data (without knowing the specifics of the data, the
default 5 will do), and have large machines since CPU is going to be a
factor (in geo case). The good think is that you can always add machines if
you need more search power, and even if you get to 10 (5 shards + 1 replica
each), you can always dynamically increase the number of replicas (and
machines) to add more search capacity.

On Thursday, March 1, 2012 at 9:17 PM, Nick Dimiduk wrote:

Hi all,

I'm interested in using Elasticsearch as a distributed geospatial
index. I have roughly 2 million records, each with a single "pin". My
primary query pattern will be geo_distance, perhaps with filters on
other fields. I will be deployed in EC2.

Let's say I want to support 8k-10k concurrent queries with reasonable
(sub-second) latencies. Can you give me some advice as to what a
properly provisioned cluster for supporting this load might look like?
How many nodes of what instance type? What kinds of shard-to-node-to-
replica ratios should I expect? Should I keep the indexes in memory or
is there benefit in leaving them on disk? Would such a cluster benefit
from node configuration tweaks (designating some as non-data nodes,
for instance)?

What are the questions I'm not asking which I should be asking? I'm
not expecting exact numbers, but orders-of-magnitude would be nice.

Thanks in advance,
-n