Modeling index for aggregation performance

Justin_Warkentin · February 25, 2015, 4:18pm

I'm a bit stuck trying to figure out the best way to model my indices for
aggregations. I'm currently storing article hits in indices that roll over
each month. Each index tends to have around 60M records. However, I have
two concerns:

In the future I expect the number of indices will grow into the
hundreds. If I'm trying to aggregate the total number of hits or the hits
per month of an article across the many indices, will the query end up
getting very slow since it has to aggregate across them all? Would it be
better to store all the hits for an article in the same index and use a new
index for blocks of article IDs instead of a new index per month to make
the index predictable for a certain article?
What about when I want to see what the top 10 articles of all time are?
This would require doing an aggregation of all articles across all indices,
right? How slow will that get when there are hundreds of indices with 60M+
records per index? Would it be better to store a hit counter on the article
record itself that gets updated occasionally?

Is there a better way to model the indices that would accommodate both of
these use cases?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2758357b-dadf-4097-917a-0cc54ca2109e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jpountz · February 26, 2015, 11:31pm

For 1., storing all hits for an article in the same index would not help
performance. ALso note that you mentioned numbers of indices but what
really matters to Elasticsearch is the total number of primary shards.
Having 60 indices with 1 shard is much better to Elasticsearch than one
index with 1000 shards.

Would it be better to store a hit counter on the article record itself
that gets updated occasionally?

I think this is an option that you should consider indeed. Running a query
with a sort is much more efficient than running a terms aggregation to
compute the article that has most hits.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j5Vt6tbxqe3eG6OxY-L4NsnarfkjbmDp5oTP4T0E1xjuw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Justin_Warkentin · February 27, 2015, 5:39am

How can I learn more about the overhead of shards? As I understand it, the
more documents there are per shard, the slower query performance is. If
there is only one shard per index then there is no parallelization of the
query. Right now I have an average of 60 million records per month going
into an index with the default of 5 shards and I'm creating a new index
each month. That means there's an average of about 12 million records per
shard. Perhaps I should decrease the number of shards, but I'm concerned
about query performance if I get too many documents per shard. With this
scheme that would leave me with 60 new shards per year. Of course, I have 4
data nodes and I'm planning to expand this further so they don't all have
to be on one node.

I guess what I'd really like to understand then is, what's a reasonable
number of shards per node? What's the overhead of a new active primary
shard? I can continue to spin up servers and scale horizontally adding new
nodes to handle the shards, I just need a good way to gauge how to tell
when to add a node to the cluster. Also, I imagine if I'm searching across
all indices it will wreck query performance just because at some point
there will be so many that the parallelization itself introduces too much
overhead. Hence, the idea of limiting article hits to a subset of indices
based on both article id and date to prevent queries from having to touch
everything as the data grows.

On Thursday, February 26, 2015 at 4:31:44 PM UTC-7, Adrien Grand wrote:

For 1., storing all hits for an article in the same index would not help
performance. ALso note that you mentioned numbers of indices but what
really matters to Elasticsearch is the total number of primary shards.
Having 60 indices with 1 shard is much better to Elasticsearch than one
index with 1000 shards.

Would it be better to store a hit counter on the article record itself
that gets updated occasionally?

I think this is an option that you should consider indeed. Running a query
with a sort is much more efficient than running a terms aggregation to
compute the article that has most hits.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6a462d71-2904-4d3b-b75d-c03d6de6a122%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jpountz · February 27, 2015, 11:36pm

The overhead of shards boils down to the fact that shards are Lucene
indices. There is not exact number for the right number of shards per node
although 3 is certainly OK and 100 probably too much. It's true that today
shards are processed in a single thread, so having fewer larger shards
might increase latency a bit (but it does not hurt throughput however).

The chapter about capacity planning from the book might be helpful too:

On Fri, Feb 27, 2015 at 6:39 AM, Justin Warkentin <
justin.warkentin@gmail.com> wrote:

How can I learn more about the overhead of shards? As I understand it, the
more documents there are per shard, the slower query performance is. If
there is only one shard per index then there is no parallelization of the
query. Right now I have an average of 60 million records per month going
into an index with the default of 5 shards and I'm creating a new index
each month. That means there's an average of about 12 million records per
shard. Perhaps I should decrease the number of shards, but I'm concerned
about query performance if I get too many documents per shard. With this
scheme that would leave me with 60 new shards per year. Of course, I have 4
data nodes and I'm planning to expand this further so they don't all have
to be on one node.

I guess what I'd really like to understand then is, what's a reasonable
number of shards per node? What's the overhead of a new active primary
shard? I can continue to spin up servers and scale horizontally adding new
nodes to handle the shards, I just need a good way to gauge how to tell
when to add a node to the cluster. Also, I imagine if I'm searching across
all indices it will wreck query performance just because at some point
there will be so many that the parallelization itself introduces too much
overhead. Hence, the idea of limiting article hits to a subset of indices
based on both article id and date to prevent queries from having to touch
everything as the data grows.

On Thursday, February 26, 2015 at 4:31:44 PM UTC-7, Adrien Grand wrote:

For 1., storing all hits for an article in the same index would not help
performance. ALso note that you mentioned numbers of indices but what
really matters to Elasticsearch is the total number of primary shards.
Having 60 indices with 1 shard is much better to Elasticsearch than one
index with 1000 shards.

Would it be better to store a hit counter on the article record itself
that gets updated occasionally?

I think this is an option that you should consider indeed. Running a
query with a sort is much more efficient than running a terms aggregation
to compute the article that has most hits.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6a462d71-2904-4d3b-b75d-c03d6de6a122%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/6a462d71-2904-4d3b-b75d-c03d6de6a122%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6BMp7%3DR-m3GWfzgnANsOFVySehuCB2BoAUA5m1Egzr5w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Optimizing Performance and Managing Daily Data Updates in Elasticsearch with Multiple Indices Elasticsearch	3	87	March 29, 2024
Cross indexes aggregations - should I worry? Elasticsearch	2	516	July 5, 2017
Small vs Large indices Elasticsearch	7	7819	July 5, 2017
Extremely slow nested aggregations, need suggestions on modeling/shards Elasticsearch	1	482	July 5, 2017
Scalability questions Elasticsearch	6	425	July 6, 2017

Modeling index for aggregation performance

Related Topics