I am sending 30 GB data from databricks to elasticsearch through the elasticsearch apache hadoop connector, It is taking around 2 hours for sending it. How to make it fast? How much time it should take ideally?

My elasticsearch connector configs are:

.option("es.write.operation.parallelism", "4")
      .option("es.batch.size.bytes", "10mb")
      .option("es.batch.size.entries", "1000")
      .option("es.batch.write.retry.count", "5")
      .option("es.batch.write.retry.wait", "10s")

My elasticsearch cluster has 3 nodes as

  1. machine 1 (master and data node) : 8 core cpu, 32GB ram and secondary storage (128 GB)
  2. machine 2 (data node): 4 core cpu ,16GB ram and secondary storage (64 GB)
  3. machine 3 (data node) : 4 core cpu, 16GB ram and secondary storage (64 GB)

Which version of Elasticsearch are you using?

What is the average size of your documents?

How many indices and shards are you actively indexing into?

Are you sending indexing requests to all nodes in the cluster?

What type of storage do you have? SSDs?

Version = 7.17
Average size of documents = 1.74 Kb
Only one index which has 3 shards, 0 replica.
Sending request to master node only.
SSD.

You should send requests to all data nodes. The master does nothing special for request processing. Have you tried increasing the level of parallelism?

I am sending data to only one IP, which is the master node. (es_host)
Data is being written in all the 3 nodes through elasticsearch.yml config.
After a certain value of parallelism, there is failure in some tasks of spark, thus the operation fails. (Tried parallelism 8 and 16).
How to send data to all data node. for example I have 3 es_host IPs.

Unfortunately, as you've discovered, there's no single best configuration for writing from spark to Elasticsearch. It's a careful balance of trying to send as much data as Elasticsearch can handle, without overwhelming Elasticsearch.

How to send data to all data node. for example I have 3 es_host IPs.

You can put a comma-delimited list of nodes in the es.nodes setting.

.option("es.write.operation.parallelism", "4")

I can't find any reference to that setting in the code. Did you mean to put something else here? I think you can probably remove this.

How many spark executors are you writing from? I assume Tried parallelism 8 and 16 means you've tried writing from 8 and 16 executors? Given that your whole cluster only has 3 shards, you're probably not going to get much benefit from using that many executors. Could you try writing to an index with more shards? How many failures are you getting?

My general advice would be to increase the number of shards somewhat. Then start writing from spark to all Elasticsearch nodes with the number of executors equal to the number of shards you have. Then slowly increase the number of executors until Elasticsearch starts rejecting bulk requests. Then reduce the number of executors a little and make sure you're not getting any rejections.

Increased the number of shards to 30 in the index.
Used the following API to create index.

PUT /index-name?pretty
{
  "settings": {
    "index": {
      "number_of_shards": 30,  
      "number_of_replicas": 0,
      "refresh_interval" : "-1"
    }
  }
}

Still it is taking the same amount of time, there is no significant improvement.
Are there any other configs?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.