Spark uses one ES node at a time to write to elastic search

Saravanan_Balakrishn · October 4, 2017, 5:31pm

I am having few clarifications with regarding Spark and ElasticSearch.

ES versions used for this approach is 5.4
Config Details:
8 ES nodes with 32 shards.
12 Spark Nodes with 4 threads each.
-- spark.executor.instances=12 -- spark.executor.cores=4 -- executor-memory 8g
Issue and Question:
Reading from ES and write to S3 object file

When reading data from ES and write to S3 as object files. it writes only 32 files. Is that possible to config or set parameters to write more than 32 files. like, write the small number of multiple files?
if so how do to do that.Reading from s3 obj file and write to ES
When Spark read these 32 obj files and write to ES with 8 nodes, it uses only one node at a time to load a file and rest of the 7 nodes are doing nothing. How to make sure Spark uses or distributes load across all nodes.

one thing I did and which helped me is I repartitioned the RDD with 20K( pairRDD.repartition(20000)) and then spark can able to use all the 8 nodes to write to ES index.

Can someone please clarify is this the way Spark works or am I missing something or understood incorrectly.
Note this issue has not occurred before , and I used the same code.
Few differences are,

ES version previously 1.7 and not 5.4
No nested mapping before and now it has nested mapping
no niofs and now we have niofs
previously when i write to s3 it generates 56GB of 17 files and now its only 29 GB of same file #s and some extra 95B files which has "SEQ�!org.apache.hadoop.io.NullWritable"org.apache.hadoop.io.BytesWritable��vcö≈p$$ãÆ�c»øoüŒ". this content in it.
Thanks in advance.

james.baiera · October 5, 2017, 5:22pm

Since you have only 32 shards to read from, ES-Hadoop will create one Spark partition per shard when reading. This means that if each partition writes out to a single output file - you'll have 32 files in your output directory.

As for the single file reading, we've seen this in the past, but haven't been able to narrow down what causes it. During write operations ES-Hadoop just executes the current job with a function that sends the data received to Elasticsearch. Is the job spinning up multiple tasks and only running one at a time sequentially, or is it standing up one big task and running it on a single node?

The re-partition operation fixing the problem makes me think that this is a Spark issue, since no ES-Hadoop operations should have occurred between reading the data from S3 and doing the repartition operation.

Saravanan_Balakrishn · October 5, 2017, 6:01pm

Thank you, James, for your reply. It's spinning up multiple tasks and only running one at a time.

james.baiera · October 11, 2017, 9:03pm

Could you share your job configurations as well as the code from your spark driver?

system · November 8, 2017, 9:03pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Spark + Elastic search write performance issue Elasticsearch es-hadoop	2	2514	November 28, 2017
Multiple ES clusters in SparkSQL Elasticsearch es-hadoop	9	2878	July 6, 2017
Indexing data in bulk in Elasticsearch using PySpark Elasticsearch es-hadoop	1	1355	July 6, 2017
saveToEs Write performance (elasticsearch-spark) Elasticsearch es-hadoop	3	2742	July 6, 2017
Is elasticsearch-spark reading from localhost if ES and Spark is running on the same node? Elasticsearch es-hadoop	2	1198	July 6, 2017

Spark uses one ES node at a time to write to elastic search

Related topics