I am sending 30 GB data from databricks to elasticsearch through the elasticsearch apache hadoop connector, It is taking around 2 hours for sending it. How to make it fast? How much time it should take ideally?

Sagnik_Mandal · May 18, 2023, 5:29am

My elasticsearch connector configs are:

.option("es.write.operation.parallelism", "4")
      .option("es.batch.size.bytes", "10mb")
      .option("es.batch.size.entries", "1000")
      .option("es.batch.write.retry.count", "5")
      .option("es.batch.write.retry.wait", "10s")

My elasticsearch cluster has 3 nodes as

machine 1 (master and data node) : 8 core cpu, 32GB ram and secondary storage (128 GB)
machine 2 (data node): 4 core cpu ,16GB ram and secondary storage (64 GB)
machine 3 (data node) : 4 core cpu, 16GB ram and secondary storage (64 GB)

Christian_Dahlqvist · May 18, 2023, 9:42am

Which version of Elasticsearch are you using?

What is the average size of your documents?

How many indices and shards are you actively indexing into?

Are you sending indexing requests to all nodes in the cluster?

What type of storage do you have? SSDs?

Sagnik_Mandal · May 18, 2023, 11:03am

Version = 7.17
Average size of documents = 1.74 Kb
Only one index which has 3 shards, 0 replica.
Sending request to master node only.
SSD.

Christian_Dahlqvist · May 18, 2023, 11:29am

You should send requests to all data nodes. The master does nothing special for request processing. Have you tried increasing the level of parallelism?

Sagnik_Mandal · May 18, 2023, 12:18pm

I am sending data to only one IP, which is the master node. (es_host)
Data is being written in all the 3 nodes through elasticsearch.yml config.
After a certain value of parallelism, there is failure in some tasks of spark, thus the operation fails. (Tried parallelism 8 and 16).
How to send data to all data node. for example I have 3 es_host IPs.

Keith_Massey · May 18, 2023, 1:58pm

Unfortunately, as you've discovered, there's no single best configuration for writing from spark to Elasticsearch. It's a careful balance of trying to send as much data as Elasticsearch can handle, without overwhelming Elasticsearch.

How to send data to all data node. for example I have 3 es_host IPs.

You can put a comma-delimited list of nodes in the es.nodes setting.

.option("es.write.operation.parallelism", "4")

I can't find any reference to that setting in the code. Did you mean to put something else here? I think you can probably remove this.

How many spark executors are you writing from? I assume Tried parallelism 8 and 16 means you've tried writing from 8 and 16 executors? Given that your whole cluster only has 3 shards, you're probably not going to get much benefit from using that many executors. Could you try writing to an index with more shards? How many failures are you getting?

My general advice would be to increase the number of shards somewhat. Then start writing from spark to all Elasticsearch nodes with the number of executors equal to the number of shards you have. Then slowly increase the number of executors until Elasticsearch starts rejecting bulk requests. Then reduce the number of executors a little and make sure you're not getting any rejections.

Sagnik_Mandal · May 18, 2023, 2:40pm

Increased the number of shards to 30 in the index.
Used the following API to create index.

PUT /index-name?pretty
{
  "settings": {
    "index": {
      "number_of_shards": 30,  
      "number_of_replicas": 0,
      "refresh_interval" : "-1"
    }
  }
}

Still it is taking the same amount of time, there is no significant improvement.
Are there any other configs?

system · June 15, 2023, 2:40pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Spark Connector performance issue [thread contd] Elasticsearch es-hadoop	3	965	May 8, 2018
Elasticsearch bails out ES-HADOOP PLUGIN Elasticsearch es-hadoop	6	914	April 17, 2018
Bulk write to ES \| best practices Elasticsearch es-hadoop	4	5525	July 6, 2017
How to increase writing speed to an index using Spark ES Elasticsearch es-hadoop	12	1776	January 6, 2022
Tunning ElasticSearch with Spark Elasticsearch	1	382	July 5, 2017

I am sending 30 GB data from databricks to elasticsearch through the elasticsearch apache hadoop connector, It is taking around 2 hours for sending it. How to make it fast? How much time it should take ideally?

Related topics