Does bulk insert works well with multiples inputs?

alan.santos · November 29, 2018, 12:30pm

Hi all,

My scenario:

Kafka-input 5.1.11
LS 5.6
ES 2.2
Number of topics: 110 (and increasing)
Number of shard per index (original data): 2
Number of shard per index (modifed data): 1
Number of shards: ~37167
Number of indices: ~14520
Instance type data nodes: i3.2xlarge ( 8 CPU, 61 GB memory and 1900 GB NVMe SSD storage)

The LS consumes data from various topics using the topics_pattern. Each topic will result two indices because out login at the LS filter. We basically clone the event and add on ES the original event and the modifed event (with some field less). The name of the indices follow the pattern and use topic name

index => "[@metadata][kafka][topic]-originalData-%{+YYYYMMdd}"

and the other output

index => "[@metadata][kafka][topic]-modifedData-%{+YYYYMMdd}"

I have seen a lot of merges. So, i don't know if the bulk insert work's well when we have data comming from various sources (we device the destination index on pipeline execution time). This make sense? The LS are smart enough to reorganize events before send it to ES?

There are some better way to deal with that?

Regards,

Christian_Dahlqvist · November 29, 2018, 12:39pm

You have far, far too many indices and shards unless you have a very large cluster. Please read this blog post on shards and sharding practices and look to reduce that dramatically. Having a lot of small shards is very inefficient. You may also want to read this blog post.

alan.santos · November 29, 2018, 1:06pm

The reason for that is we use time-based (days) indices for each customer. How we store data at least 4 months, we will have number_of_customer x number_of_days (on modifed event indice x 1 indice). For the original events, we store one week (number_of_customer x number_of_days x 2 shards).

Maybe, we should store time-based, but with month insted of days (on modifed Data). This can decrease a lot our number of shards and merges.

Christian_Dahlqvist · November 29, 2018, 1:08pm

Yes, that would be an option. Another option is to store multiple users per index and filter at the application layer.

alan.santos · November 29, 2018, 1:09pm

Yes. I already think about this, but store more than one customer by index can be a problem because customers have differents retention times.

Do you see other option?

Christian_Dahlqvist · November 29, 2018, 1:13pm

If retention period does not change (or does so rarely) you can always group customers into indices by retention period.

alan.santos · November 29, 2018, 1:28pm

Good point. However, i think that will not work for us because eace customer have you own kibana (if customer-indice hard code mapped).

Make a simple count. For modifedData, we call deacrease from 25495 shards to 212 (best case) or 283 (worst case). Sound so much better then now.

The original data will be a bigger annoying because, usually, the indices are huge.

I'll try this approch for modifedData and compair merges metrics. After that, i'll return there if new metrics.

Best,

alan.santos · December 3, 2018, 5:03pm

Just to answer my own question, as this blog post explain:

A bulk request can contain documents destined for multiple indices and shards. The first processing step is therefore to split it up based on which shards the documents need to be routed to. Once this is done, each bulk sub-request is forwarded to the data node that holds the corresponding primary shard, and it is there enqueued on that node’s bulk queue. If there is no more space available on the queue, the coordinating node will be notified that the bulk sub-request has been rejected.

Probably, the high number of merges is more related to huge number of shards then the bulk insert.

system · December 31, 2018, 5:03pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Bulk indexing performance Elasticsearch	10	4555	February 10, 2017
ES rejecting bulk messages when writing to indices with 40 shards Elasticsearch	12	5739	March 22, 2019
Bulk indexing and number of shards Elasticsearch	5	749	July 6, 2017
Bulk insertion taking long and throwing lots of errors Elasticsearch	6	471	July 6, 2017
How to deal with building huge bulk load indices fast without impacting prod queries or paying a fortune to over-provision the cluster Elasticsearch	10	3688	July 5, 2017

Does bulk insert works well with multiples inputs?

Related topics