Kafka-input 5.1.11
LS 5.6
ES 2.2
Number of topics: 110 (and increasing)
Number of shard per index (original data): 2
Number of shard per index (modifed data): 1
Number of shards: ~37167
Number of indices: ~14520
Instance type data nodes: i3.2xlarge ( 8 CPU, 61 GB memory and 1900 GB NVMe SSD storage)
The LS consumes data from various topics using the topics_pattern. Each topic will result two indices because out login at the LS filter. We basically clone the event and add on ES the original event and the modifed event (with some field less). The name of the indices follow the pattern and use topic name
index => "[@metadata][kafka][topic]-originalData-%{+YYYYMMdd}"
and the other output
index => "[@metadata][kafka][topic]-modifedData-%{+YYYYMMdd}"
I have seen a lot of merges. So, i don't know if the bulk insert work's well when we have data comming from various sources (we device the destination index on pipeline execution time). This make sense? The LS are smart enough to reorganize events before send it to ES?
You have far, far too many indices and shards unless you have a very large cluster. Please read this blog post on shards and sharding practices and look to reduce that dramatically. Having a lot of small shards is very inefficient. You may also want to read this blog post.
The reason for that is we use time-based (days) indices for each customer. How we store data at least 4 months, we will have number_of_customer x number_of_days (on modifed event indice x 1 indice). For the original events, we store one week (number_of_customer x number_of_days x 2 shards).
Maybe, we should store time-based, but with month insted of days (on modifed Data). This can decrease a lot our number of shards and merges.
Just to answer my own question, as this blog post explain:
A bulk request can contain documents destined for multiple indices and shards. The first processing step is therefore to split it up based on which shards the documents need to be routed to. Once this is done, each bulk sub-request is forwarded to the data node that holds the corresponding primary shard, and it is there enqueued on that node’s bulk queue. If there is no more space available on the queue, the coordinating node will be notified that the bulk sub-request has been rejected.
Probably, the high number of merges is more related to huge number of shards then the bulk insert.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.