Spark uses one ES node at a time to write to elastic search

(Saravanan Balakrishnan) #1

I am having few clarifications with regarding Spark and ElasticSearch.

ES versions used for this approach is 5.4
Config Details:
8 ES nodes with 32 shards.
12 Spark Nodes with 4 threads each.
-- spark.executor.instances=12 -- spark.executor.cores=4 -- executor-memory 8g
Issue and Question:
Reading from ES and write to S3 object file

  1. When reading data from ES and write to S3 as object files. it writes only 32 files. Is that possible to config or set parameters to write more than 32 files. like, write the small number of multiple files?
    if so how do to do that.Reading from s3 obj file and write to ES

  2. When Spark read these 32 obj files and write to ES with 8 nodes, it uses only one node at a time to load a file and rest of the 7 nodes are doing nothing. How to make sure Spark uses or distributes load across all nodes.

one thing I did and which helped me is I repartitioned the RDD with 20K( pairRDD.repartition(20000)) and then spark can able to use all the 8 nodes to write to ES index.

Can someone please clarify is this the way Spark works or am I missing something or understood incorrectly.
Note this issue has not occurred before , and I used the same code.
Few differences are,

ES version previously 1.7 and not 5.4
No nested mapping before and now it has nested mapping
no niofs and now we have niofs
previously when i write to s3 it generates 56GB of 17 files and now its only 29 GB of same file #s and some extra 95B files which has "SEQ�!"������vcö≈p$$ãÆ�c»øoüŒ". this content in it.
Thanks in advance.

(James Baiera) #2

Since you have only 32 shards to read from, ES-Hadoop will create one Spark partition per shard when reading. This means that if each partition writes out to a single output file - you'll have 32 files in your output directory.

As for the single file reading, we've seen this in the past, but haven't been able to narrow down what causes it. During write operations ES-Hadoop just executes the current job with a function that sends the data received to Elasticsearch. Is the job spinning up multiple tasks and only running one at a time sequentially, or is it standing up one big task and running it on a single node?

The re-partition operation fixing the problem makes me think that this is a Spark issue, since no ES-Hadoop operations should have occurred between reading the data from S3 and doing the repartition operation.

(Saravanan Balakrishnan) #3

Thank you, James, for your reply. It's spinning up multiple tasks and only running one at a time.

(James Baiera) #4

Could you share your job configurations as well as the code from your spark driver?

(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.