Time Date: Giant Index w/Shard Routing VS Small Indices w/Little Shards and Aliasing


(webish) #1

I am attempting to optimize time based data such as that of a newsfeed.

I've been running tests with data broken into indices based on month, week,
day. I'm using aliases to query the entire set or smaller ranges such as
"last-month", "last-quarter".

I'm still trying to figure out what will be the most ideal setup. Under
concurrency the API that queries ES is experiencing a linear growth in
response time. Even at 35 concurrent users with a single app server the
response times are somewhere in the neighbor hood of ~8sec.

Has anybody experimented with just using a single index with a very large
number of shards rather than a large number of indices with a small number
of shards?

What is are the trade offs of using one vs the other?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2b105c60-4e7c-4ee4-98c7-76b04fe18b63%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Mark Walkom) #2

Sharding is good for when you have multiple nodes, that way you have a
small number of shards per node that can be queried in parallel, rather
than one (or a few) done sequentially. However you will get similar results
by having many smaller indexes across multiple nodes. The key thing between
the two of these being spreading of the query load across nodes.

If you are starting to go down the path of shard/index count comparison
it's probably worth looking to increase the number of nodes, you'll get
better much performance improvements.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 14 May 2014 14:52, webish gregory@yoursports.com wrote:

I am attempting to optimize time based data such as that of a newsfeed.

I've been running tests with data broken into indices based on month,
week, day. I'm using aliases to query the entire set or smaller ranges
such as "last-month", "last-quarter".

I'm still trying to figure out what will be the most ideal setup. Under
concurrency the API that queries ES is experiencing a linear growth in
response time. Even at 35 concurrent users with a single app server the
response times are somewhere in the neighbor hood of ~8sec.

Has anybody experimented with just using a single index with a very large
number of shards rather than a large number of indices with a small number
of shards?

What is are the trade offs of using one vs the other?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2b105c60-4e7c-4ee4-98c7-76b04fe18b63%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/2b105c60-4e7c-4ee4-98c7-76b04fe18b63%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624Y_7VU6SSDqEr5TBf%2BfHqqXkF7adJsVktL8baDqVUubhw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(webish) #3

Ok. Makes sense.

I'd like to setup an indexing strategy for time data that will hold for
some time without needing to reshuffle everything.

Advantages I've found of the small indices and shards would be that there
is NO finite number of shards. Aliasing strategies have more power than
basic shard routing. This strategy can also use a form of shard routing.
Some disadvantages would be that there is more overhead with a large
number of open indices in a small cluster. Managing the indices also takes
more work than routing to specific shards.

Both can have issues with hotspots. It's def easier to increase the shard
count on newer indices using the small index strategy. Re-indexing a
single small index vs a large one with many shards. Or just give newer
indices more shards as the system grows.

Both solutions also have to ensure that the thread count doesn't exceed
system capacity.... (dependent on number of shards, replicas, indices,
cores, nodes, concurrency)

On Wednesday, May 14, 2014 1:00:09 AM UTC-4, Mark Walkom wrote:

Sharding is good for when you have multiple nodes, that way you have a
small number of shards per node that can be queried in parallel, rather
than one (or a few) done sequentially. However you will get similar results
by having many smaller indexes across multiple nodes. The key thing between
the two of these being spreading of the query load across nodes.

If you are starting to go down the path of shard/index count comparison
it's probably worth looking to increase the number of nodes, you'll get
better much performance improvements.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com <javascript:>
web: www.campaignmonitor.com

On 14 May 2014 14:52, webish <gre...@yoursports.com <javascript:>> wrote:

I am attempting to optimize time based data such as that of a newsfeed.

I've been running tests with data broken into indices based on month,
week, day. I'm using aliases to query the entire set or smaller ranges
such as "last-month", "last-quarter".

I'm still trying to figure out what will be the most ideal setup. Under
concurrency the API that queries ES is experiencing a linear growth in
response time. Even at 35 concurrent users with a single app server the
response times are somewhere in the neighbor hood of ~8sec.

Has anybody experimented with just using a single index with a very large
number of shards rather than a large number of indices with a small number
of shards?

What is are the trade offs of using one vs the other?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2b105c60-4e7c-4ee4-98c7-76b04fe18b63%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/2b105c60-4e7c-4ee4-98c7-76b04fe18b63%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/684c4a3a-df66-4682-b65e-389f2407d4c3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #4