Low index troughput during simultaneous indices creation


(Alan Santos) #1

Hi all,

My scenario:

Kafka-input 5.1.11
LS 5.6
ES 2.2
Number of topics: 110 (and increasing)
Number of shard per index (original data): 2
Number of shard per index (modifed data): 1

The LS consumes data from various topics using the topics_pattern. Each topic will result two indices because out login at the LS filter. We basically clone the event and add on ES the original event and the modifed event (with some field less). The name of the indices follow the pattern and use topic name

index => "[@metadata][kafka][topic]-originalData-%{+YYYYMMdd}"

and the other output

index => "[@metadata][kafka][topic]-modifedData-%{+YYYYMMdd}"

Everything work's well (usually we are near realtime logs), but when we start a new day, the troughput goes down to 0. I think that the task of various index creation at the same time is a hard thing to do, but it's annoying because the cluster come back index process only after ~ 30 minutes. I can see that on my kafka graphs:

There are some alternative to avoid this? Any options or workarounds?

Regards,


(Christian Dahlqvist) #2

How many indices and shards do you have in the cluster? What is the average shard size? How many nodes do you have in the cluster? What is the specification of each node? Why are you creating a separate index per topic and day?

Given the indexing rate in the graph you showed, I suspect that you may have far too many small indices and shards. Have a look at this blog post about shards and sharing guidelines.


(Alan Santos) #3

Number of shards: ~37167
Number of indices: ~14520
Number of data nodes: 12
Number of clientes nodes: 3
Number of master nodes: 3
Instance type data nodes: i3.2xlarge ( 8 CPU, 61 GB memory and 1900 GB NVMe SSD storage)
Shard average size: I really don't know. I can calculate it now, but it's depends on customer traffic, so, tomorrow we will have a different value for sure. I know that shards have a huge different. (from 200 kb to 90 GB).

Each topic represents a customer. Those customers have different log retention policy (we use curator after that to cleanup indices).

This can really happen because we don't know how many data our customer will recive (those index store a kind of accesslog, but with payloads).


(Christian Dahlqvist) #4

That is far too many shards, and given the disk space per host they should be quite small as well. You will need to reduce that significantly in my opinion. That many indices and shards has potentially led to a quite large cluster state that is taking time up update and distribute. If you have problem managing the index size when indices correspond to a fixed time-period, instead use the rollover index API and target a specific ideal size rather than time-period. Adjust the maximum period an index can handle based on the total retention period. This is described in this blog post.