How many shards for 60GB daily index?

I am trying to configure an ELK stack and have timeouts when performing queries from kibana.
After researching for a while I couldn't find a good answer to my question, how many shards should I use...

We use ELK to collect production logs.

The description of the stack:

  • All the ELK stack runs using docker on a single server.
  • ElasticSearch has only 1 node with 8GB out of 16 GB RAM given to the JVM (-Xmx8g -Xms8g)
  • Logstash creates a daily index
  • For now we don't really care to lose the logs so we don't have any replicas
  • We use the default value of shards per index (5)
  • For now we only keep 3 days of logs. (3 indexes) which give us ~50 million entries per index and around 150GB of total data size.
  • We average around 2.2 million entries hourly

When we query in Kibana for the last 15 minutes or 1hr it returns the result after around 10 seconds but when querying for 4hrs it gets a timeout after 30seconds.

We've also noticed that the CPU of the machine is very high almost always and the RAM is very high as well.

From what I've read people say that the shard size should be less than 30GB from one hand and to use as little shards per node as possible.
Given the constrain of a single node it seems like the number of shards should be 2 per index to supply the 30GB rule but if we do that we will have 2 shards per index, with 3 saved indexes (for 3 days) which will give us 6 shards for that node.
Q1: Will it work ok or should we reduce it to 1 shard per index?

Another thing to notice: our indices are named `logstash-year.month.day' and in kibana we created an index pattern 'logstash-*'.
Q2: Does it query all the indices even when we specify in kibana a date range? Should we also create a daily index pattern in kibana?

Q3: We are currently trying to avoid creating multiple nodes. Is there a way to do it with a single node or should we create more nodes? If more how many will work and with how many shards?

Thanks!

FYI we’ve renamed ELK to the Elastic Stack, otherwise Beats feels left out :wink:

Check your Elasticsearch logs, for such a small amount of data it shouldn't time out.

2 primaries will be fine.

No, it knows which ones it needs to query.

To do what exactly?

To handle the load I described

I will try to change the number of shards to be 2 by reindexing the data and setting the default to be 2 shards and see how it behaves, I will update

A single node should be fine for this. If it's not, then I'd be using X-Pack Monitoring to look at what is happening on the node when you have these load issues.

The minimum query latency will depend on the data, queries and shard size, so I would recommend running a benchmark to determine the most appropriate shard size for your use-case as described in this video.

Having large shards is important in order to not get too many shards when you have a long retention period. If you however want to keep your retention period at 3 days and e.g. have dashboards that refresh automatically at an interval (send the same queries but with a varying time frame), you might actually be better off using an hourly index with a single primary shard. The reason for this is that indices no longer receiving data would be able to cache more efficiently. Given your heap size the number of shards generated should be manageable.

But if we monitor 24 hours period, doesn't that mean that 24 shards will run the query on the same node?
As I understand from blog posts we want to reduce the number of shards on one node that participate in the query, or am I missing something?

That is correct. If you are running the same query over and over, e.g. by refreshing a displayed dashboard, most of these indices are not changing as no new data is arriving, which means that they will be able to cache more efficiently. If you however are exploring your data and constantly changing the queries and filters, this may indeed be worse than having a single larger index.

The optimal number and size of indices will depend on your query patterns, and it is quite possible that a solution somewhere in between might be optimal.

A small update:
I have switched to 2 shards and it is still not fast enough.

There are now 6 shards in total (3 days of 2 shards per day)
Each shard weighs around 6gb and has ~20million entries.

Querying 15 minutes returns after ~20seconds
Querying 30 minutes returns after ~18seconds
Querying 1 hours returns after ~13seconds
Querying 4 hours returns after ~25seconds

Querying 12 hours times out after 30 seconds.
Querying 24 hours times out after 30 seconds.

The queries wre made ad 1:36PM
Meaning:
15min/30min/1hrs/4hrs/12hrs -> 2 shards ran the query
24hrs -> 4 shards ran the query

During the queries the CPU was on ~50-75%

I will try to change the number of shards per index to 1 and see how it will affect the results.

What does disk I/O and iowait look like during querying?

I've started changing the number of shards to 1, so now we have 2 days with 2 shards and 1 day with 1 shard (not full day):

Date         num docs   size
2017.10.16   10162776   2.8gb
2017.10.14   20753623     6gb 
2017.10.14   20759543     6gb 
2017.10.15   26096116   7.9gb
2017.10.15   26097486     8gb 

iostat -mx 20 no queries, only writes:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7.02    0.00    0.51    0.19    0.00   92.28

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.45    0.00   15.80     0.00     1.88   243.89     0.04    2.74    0.00    2.74   2.16   3.41
sdb               0.00     0.05    0.00    0.15     0.00     0.00    53.33     0.00    0.33    0.00    0.33   0.33   0.01
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.20     0.00     0.00    40.00     0.00    0.50    0.00    0.50   0.25   0.01
dm-4              0.00     0.00    0.00   16.15     0.00     1.88   238.60     0.04    2.61    0.00    2.61   2.11   3.41
dm-5              0.00     0.00    0.00    0.20     0.00     0.00    40.00     0.00    0.50    0.00    0.50   0.25   0.01
dm-6              0.00     0.00    0.00    0.15     0.00     0.00    32.00     0.00    0.67    0.00    0.67   0.33   0.01
dm-7              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-8              0.00     0.00    0.00    0.05     0.00     0.00    64.00     0.00    0.00    0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.97    0.00    0.46    0.06    0.00   93.51

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.65    0.00   15.30     0.00     0.39    51.78     0.01    0.65    0.00    0.65   0.59   0.91
sdb               0.00     0.00    0.00    0.20     0.00     0.00    40.00     0.00    1.00    0.00    1.00   0.50   0.01
dm-0              0.00     0.00    0.00    0.15     0.00     0.00    14.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.20     0.00     0.00    40.00     0.00    1.00    0.00    1.00   0.50   0.01
dm-4              0.00     0.00    0.00   15.85     0.00     0.39    49.88     0.01    0.68    0.00    0.68   0.58   0.93
dm-5              0.00     0.00    0.00    0.20     0.00     0.00    40.00     0.00    1.00    0.00    1.00   0.50   0.01
dm-6              0.00     0.00    0.00    0.05     0.00     0.00    64.00     0.00    2.00    0.00    2.00   2.00   0.01
dm-7              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-8              0.00     0.00    0.00    0.15     0.00     0.00    32.00     0.00    0.67    0.00    0.67   0.67   0.01

12hrs query which now runs for ~15 seconds, iostat -mx 15 gives:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          42.56    0.00    1.42    0.39    0.00   55.63

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.13    0.00   97.60     0.00     1.96    41.14     0.07    0.70    0.00    0.70   0.63   6.16
sdb               0.00     0.07    0.00    0.20     0.00     0.01    53.33     0.00    1.00    0.00    1.00   1.00   0.02
dm-0              0.00     0.00    0.00    0.40     0.00     0.00     8.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.27     0.00     0.01    40.00     0.00    1.50    0.00    1.50   0.75   0.02
dm-4              0.00     0.00    0.00   97.20     0.00     1.96    41.28     0.07    0.70    0.00    0.70   0.64   6.23
dm-5              0.00     0.00    0.00    0.27     0.00     0.01    40.00     0.00    1.50    0.00    1.50   0.75   0.02
dm-6              0.00     0.00    0.00    0.20     0.00     0.00    32.00     0.00    2.00    0.00    2.00   1.00   0.02
dm-7              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-8              0.00     0.00    0.00    0.07     0.00     0.00    64.00     0.00    0.00    0.00    0.00   0.00   0.00

24hrs query which now sometimes timesout and sometimes runs within ~20-25 seconds,iostat -mx 20 gives:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          46.52    0.00    1.07    0.16    0.00   52.25

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00   74.70     0.00     1.44    39.51     0.04    0.55    0.00    0.55   0.48   3.55
sdb               0.00     0.05    0.00    0.15     0.00     0.00    53.33     0.00    1.67    0.00    1.67   1.67   0.03
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.20     0.00     0.00    40.00     0.00    2.50    0.00    2.50   1.25   0.03
dm-4              0.00     0.00    0.00   74.25     0.00     1.44    39.75     0.04    0.53    0.00    0.53   0.48   3.58
dm-5              0.00     0.00    0.00    0.20     0.00     0.00    40.00     0.00    2.50    0.00    2.50   1.25   0.03
dm-6              0.00     0.00    0.00    0.15     0.00     0.00    32.00     0.00    3.33    0.00    3.33   1.67   0.03
dm-7              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-8              0.00     0.00    0.00    0.05     0.00     0.00    64.00     0.00    0.00    0.00    0.00   0.00   0.00

One thing I want to mention is the query I run is a query without any condition...When I start adding conditions the queries start returning much quicker even including all 3 days.
When viewing a dashboard with 10 visualizations it still behaves poorly:

12 hrs of dashboard, returning after ~10secs, iostat -mx 10:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          71.60    0.00    0.79    0.05    0.00   27.56

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.10    0.00   75.30     0.00     1.65    44.83     0.05    0.66    0.00    0.66   0.59   4.42
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-4              0.00     0.00    0.00   74.80     0.00     1.65    45.13     0.05    0.65    0.00    0.65   0.59   4.42
dm-5              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-7              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-8              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

24 hours of dashboard which timesout after 30secs, iostat -mx 20:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          95.65    0.00    2.73    0.08    0.00    1.55

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.05     3.25    0.30   41.55     0.20     9.35   467.49     0.69   16.46  128.83   15.65   7.22  30.22
sdb               0.00     0.05    0.00    0.25     0.00     0.01    45.00     0.00    4.60    0.00    4.60   4.40   0.11
dm-0              0.00     0.00    0.00    0.10     0.00     0.00     5.50     0.00   27.50    0.00   27.50  27.50   0.28
dm-1              0.00     0.00    0.00    2.05     0.00     0.01     8.00     0.03   16.41    0.00   16.41   1.37   0.28
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.30     0.00     0.01    37.50     0.00    5.33    0.00    5.33   3.67   0.11
dm-4              0.00     0.00    0.25   40.35     0.20     9.35   481.47     0.68   16.71  136.20   15.97   7.45  30.25
dm-5              0.00     0.00    0.00    0.30     0.00     0.01    37.50     0.00    5.33    0.00    5.33   3.67   0.11
dm-6              0.00     0.00    0.00    0.15     0.00     0.00    32.00     0.00    6.67    0.00    6.67   3.67   0.05
dm-7              0.00     0.00    0.00    0.10     0.00     0.00    32.50     0.00    4.00    0.00    4.00   4.00   0.04
dm-8              0.00     0.00    0.00    0.05     0.00     0.00    64.00     0.00    4.00    0.00    4.00   4.00   0.02

When running 24hrs dashboard again after it failed, it now returns after ~20-25 seconds, proably because of caching

One more thing, I have the Logstash, Elastichsearch and Kibana deployed on the same machine in docker containers. Could it be that the Logstash is the bad guy and consumes Elasticsearch resources?

We've upgraded to a machine with more RAM givin elastic 30GB of ram to play around with, it helped.
But the most significant issue I found was the way our visualizations were built.
We had many visualizations without queries, only using filters and some of the visualizations used filter of the form !(some condition) which led to matching almost all the data.
So 2 conclusions I found are:

  • Almays use queries in visualizations before aggregating them
  • Try avoiding filters of the form !(some condition) unless you added a query to significantly reduce the number of matching logs

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.