ES - Spark tuning for bulk writes

Vishnu_Singhal · December 19, 2020, 4:22pm

I;m using Spark 3 with ES 7.10 , ECK 1.3

I have following settings

|es.batch.size.bytes|6000000|
|es.batch.size.entries|10000|
|es.batch.write.refresh|false|
|es.batch.write.retry.count|6|

Would like to know how can we tune properly. Whats the base criteria. I have tried to

"transient": {
"indices.store.throttle.type": "none"
},
"persistent" : {
"threadpool.bulk.type": "fixed",
"threadpool.bulk.size": 60,
"threadpool.bulk.queue_size": 3000,
"threadpool.generic.keep_alive": "5m"
}
}'

But its not happening with ES 7.10. How can we measure the data written per request at elasticsearch and identify the bottleneck .

ES i3 2xlarge - 5 data nodes, 3 master nodes.

Also rejection write thread has high number,
Output from one node

"write": {
"threads": 6,
"queue": 0,
"active": 0,
"rejected": 481,
"largest": 6,
"completed": 625803
}

Christian_Dahlqvist · December 19, 2020, 7:17pm

How many indices and shards are you actively indexing into?

Vishnu_Singhal · December 20, 2020, 4:10pm

Indexes around 4 per day, with 24 shards in each

Christian_Dahlqvist · December 20, 2020, 4:17pm

Why so many shards? How much data do you index per day?

Vishnu_Singhal · December 20, 2020, 4:49pm

Around 900m per day.

Christian_Dahlqvist · December 20, 2020, 5:07pm

How much is that in GB?

Vishnu_Singhal · December 20, 2020, 5:35pm

Its approx min ~850GB to max 1000GB per index per day

Christian_Dahlqvist · December 20, 2020, 5:42pm

The best way to improve efficiency and reduce the risk of rejection problems is often to limit the number of shards you index into. Instead of having daily indices with a large number of shards I would recommend switching to using rollover and ILM. That way you can set the number of primary shards to 5 per index (same as your number of data nodes) and have rollover switch to new underlying indices based on size (often recommended to around 50GB per shard) and/or time. In your case it probably means that you would generate multiple indices per day where each cover a shorter time period. This should be more efficient.

Vishnu_Singhal · December 20, 2020, 5:48pm

org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: org.elasticsearch.hadoop.rest.EsHadoopRemoteException: es_rejected_execution_exception: rejected execution of org.elasticsearch.ingest.IngestService$3@1f629a72 on EsThreadPoolExecutor[name = ct-es-es-data-nodes-0/write, queue capacity = 500, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@4e705b38[Running, pool size = 6, active threads = 6, queued tasks = 622, completed tasks = 547409]]

Christian_Dahlqvist · December 20, 2020, 5:53pm

Also make sure you are following these guidelines if you are not already.

Vishnu_Singhal · December 20, 2020, 6:17pm

yes @Christian_Dahlqvist, read this already and applied them.

Vishnu_Singhal · December 21, 2020, 3:39pm

@Christian_Dahlqvist, still jobs are taking time, and my data nodes cpu are less than 30% only.

Christian_Dahlqvist · December 21, 2020, 3:46pm

Indexing tends to be disk I/O intensive, so CPU is often not the limiting factor. As you are using i3 instance I suspect you should be fine though. How many concurrent clients/connections are you using for indexing?

Vishnu_Singhal · December 22, 2020, 3:02am

Yes im using i3 instance with provision IOPS .
Using currently 32 connections

org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: org.elasticsearch.hadoop.rest.EsHadoopRemoteException: es_rejected_execution_exception: rejected execution of coordinating operation [coordinating_and_primary_bytes=68231502, replica_bytes=44867860, all_bytes=113099362, coordinating_operation_bytes=6528502, max_coordinating_and_primary_bytes=107374182]

Christian_Dahlqvist · December 22, 2020, 8:16am

Are you sending bulk requests to all data nodes, avoiding the master nodes?

Vishnu_Singhal · December 23, 2020, 9:36am

@Christian_Dahlqvist
I have deployed using ECK. So was using http service created by that eck deployment

kibana-kb-http ClusterIP 172.20.102.107 5601/TCP 3d21h
es-es-data-nodes ClusterIP None 27h
es-es-http ClusterIP 172.20.153.65 9200/TCP 27h
es-es-master ClusterIP None 27h

using es-es-http service which has 9200 port exposed. Master doesnt have same port exposed.

Vishnu_Singhal · December 27, 2020, 3:32pm

@Christian_Dahlqvist , i have following from one of the node

iostat

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          38.51    0.00    2.44    0.55    0.06   58.43

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
xvda              1.37         1.38        15.62     345772    3915036
nvme0n1           0.00         0.01         0.00       2138          0
xvdc                370.29       397.94     25726.29   99759173 6449231588

system · January 24, 2021, 3:32pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Bulk write to ES \| best practices Elasticsearch es-hadoop	4	5565	July 6, 2017
Spark + Elastic search write performance issue Elasticsearch es-hadoop	2	2512	November 28, 2017
How to increase writing speed to an index using Spark ES Elasticsearch es-hadoop	12	1809	January 6, 2022
Performance of Spark bulk index to Elasticsearch Elasticsearch es-hadoop	3	2607	September 1, 2017
Spark tuning for Elasticsearch - how to increase Index/Ingest throughput Elasticsearch es-hadoop	3	4548	July 6, 2017

ES - Spark tuning for bulk writes

Related topics