I would like to know the relation between Spark executors, cores with Elasticsearch batch size and how to tune Spark job optimally to get better index throughput.
I have around 3.5B documents stored in Parquet format and I would like to ingest them to Elasticsearch and I'm not getting more than 20K index rate. Sometimes I got 60K-70K but it comes down immediately and the average I got was around 15K-25K indexes per second.
Little bit more details about my input:
- Around 22,000 files in Parquet format
- It contains around 3.2B records (around 3TB in size)
- Currently running 18 executors (3 executors per node)
Details about my current ES setup:
- 8 nodes, 1 master and 7 data nodes
- Instace type: c4.8xlarge
- Index with 70 shards
- Index contains 49 fields (none of them are analyzied)
- No replication
- "indices.store.throttle.type" : "none"
- "refresh_interval" : "-1"
- es.batch.size.bytes: 100M (I tried with 200M also)
- es.batch.size.entries: 10000 (I tried with different value also)
I tried with different partitions with different combination of # of executors or cores but didn't get enough performance gain.
I'm very new to Elasticsearch so not sure how to tune my Spark job to get better performance. Any guidance will be highly appreciated.