ES - Spark tuning for bulk writes

I;m using Spark 3 with ES 7.10 , ECK 1.3

I have following settings


Would like to know how can we tune properly. Whats the base criteria. I have tried to

"transient": {
"": "none"
"persistent" : {
"threadpool.bulk.type": "fixed",
"threadpool.bulk.size": 60,
"threadpool.bulk.queue_size": 3000,
"threadpool.generic.keep_alive": "5m"

But its not happening with ES 7.10. How can we measure the data written per request at elasticsearch and identify the bottleneck .

ES i3 2xlarge - 5 data nodes, 3 master nodes.

Also rejection write thread has high number,
Output from one node

"write": {
"threads": 6,
"queue": 0,
"active": 0,
"rejected": 481,
"largest": 6,
"completed": 625803

How many indices and shards are you actively indexing into?

Indexes around 4 per day, with 24 shards in each

Why so many shards? How much data do you index per day?

Around 900m per day.

How much is that in GB?

Its approx min ~850GB to max 1000GB per index per day

The best way to improve efficiency and reduce the risk of rejection problems is often to limit the number of shards you index into. Instead of having daily indices with a large number of shards I would recommend switching to using rollover and ILM. That way you can set the number of primary shards to 5 per index (same as your number of data nodes) and have rollover switch to new underlying indices based on size (often recommended to around 50GB per shard) and/or time. In your case it probably means that you would generate multiple indices per day where each cover a shorter time period. This should be more efficient. es_rejected_execution_exception: rejected execution of org.elasticsearch.ingest.IngestService$3@1f629a72 on EsThreadPoolExecutor[name = ct-es-es-data-nodes-0/write, queue capacity = 500, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@4e705b38[Running, pool size = 6, active threads = 6, queued tasks = 622, completed tasks = 547409]]

Also make sure you are following these guidelines if you are not already.

yes @Christian_Dahlqvist, read this already and applied them.

@Christian_Dahlqvist, still jobs are taking time, and my data nodes cpu are less than 30% only.

Indexing tends to be disk I/O intensive, so CPU is often not the limiting factor. As you are using i3 instance I suspect you should be fine though. How many concurrent clients/connections are you using for indexing?

Yes im using i3 instance with provision IOPS .
Using currently 32 connections es_rejected_execution_exception: rejected execution of coordinating operation [coordinating_and_primary_bytes=68231502, replica_bytes=44867860, all_bytes=113099362, coordinating_operation_bytes=6528502, max_coordinating_and_primary_bytes=107374182]

Are you sending bulk requests to all data nodes, avoiding the master nodes?

I have deployed using ECK. So was using http service created by that eck deployment

kibana-kb-http ClusterIP 5601/TCP 3d21h
es-es-data-nodes ClusterIP None 27h
es-es-http ClusterIP 9200/TCP 27h
es-es-master ClusterIP None 27h

using es-es-http service which has 9200 port exposed. Master doesnt have same port exposed.

@Christian_Dahlqvist , i have following from one of the node


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          38.51    0.00    2.44    0.55    0.06   58.43

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
xvda              1.37         1.38        15.62     345772    3915036
nvme0n1           0.00         0.01         0.00       2138          0
xvdc                370.29       397.94     25726.29   99759173 6449231588

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.