Does bulk insert works well with multiples inputs?


(Alan Santos) #1

Hi all,

My scenario:

Kafka-input 5.1.11
LS 5.6
ES 2.2
Number of topics: 110 (and increasing)
Number of shard per index (original data): 2
Number of shard per index (modifed data): 1
Number of shards: ~37167
Number of indices: ~14520
Instance type data nodes: i3.2xlarge ( 8 CPU, 61 GB memory and 1900 GB NVMe SSD storage)

The LS consumes data from various topics using the topics_pattern. Each topic will result two indices because out login at the LS filter. We basically clone the event and add on ES the original event and the modifed event (with some field less). The name of the indices follow the pattern and use topic name

index => "[@metadata][kafka][topic]-originalData-%{+YYYYMMdd}"

and the other output

index => "[@metadata][kafka][topic]-modifedData-%{+YYYYMMdd}"

I have seen a lot of merges. So, i don't know if the bulk insert work's well when we have data comming from various sources (we device the destination index on pipeline execution time). This make sense? The LS are smart enough to reorganize events before send it to ES?

There are some better way to deal with that?

Regards,


(Christian Dahlqvist) #2

You have far, far too many indices and shards unless you have a very large cluster. Please read this blog post on shards and sharding practices and look to reduce that dramatically. Having a lot of small shards is very inefficient. You may also want to read this blog post.


(Alan Santos) #3

The reason for that is we use time-based (days) indices for each customer. How we store data at least 4 months, we will have number_of_customer x number_of_days (on modifed event indice x 1 indice). For the original events, we store one week (number_of_customer x number_of_days x 2 shards).

Maybe, we should store time-based, but with month insted of days (on modifed Data). This can decrease a lot our number of shards and merges.


(Christian Dahlqvist) #4

Yes, that would be an option. Another option is to store multiple users per index and filter at the application layer.


(Alan Santos) #5

Yes. I already think about this, but store more than one customer by index can be a problem because customers have differents retention times.

Do you see other option?


(Christian Dahlqvist) #6

If retention period does not change (or does so rarely) you can always group customers into indices by retention period.


(Alan Santos) #7

Good point. However, i think that will not work for us because eace customer have you own kibana (if customer-indice hard code mapped).

Make a simple count. For modifedData, we call deacrease from 25495 shards to 212 (best case) or 283 (worst case). Sound so much better then now.

The original data will be a bigger annoying because, usually, the indices are huge.

I'll try this approch for modifedData and compair merges metrics. After that, i'll return there if new metrics.

Best,


(Alan Santos) #8

Just to answer my own question, as this blog post explain:

A bulk request can contain documents destined for multiple indices and shards. The first processing step is therefore to split it up based on which shards the documents need to be routed to. Once this is done, each bulk sub-request is forwarded to the data node that holds the corresponding primary shard, and it is there enqueued on that node’s bulk queue. If there is no more space available on the queue, the coordinating node will be notified that the bulk sub-request has been rejected.

Probably, the high number of merges is more related to huge number of shards then the bulk insert.