I am having few clarifications with regarding Spark and ElasticSearch.
ES versions used for this approach is 5.4
Config Details:
8 ES nodes with 32 shards.
12 Spark Nodes with 4 threads each.
-- spark.executor.instances=12 -- spark.executor.cores=4 -- executor-memory 8g
Issue and Question:
Reading from ES and write to S3 object file
-
When reading data from ES and write to S3 as object files. it writes only 32 files. Is that possible to config or set parameters to write more than 32 files. like, write the small number of multiple files?
if so how do to do that.Reading from s3 obj file and write to ES -
When Spark read these 32 obj files and write to ES with 8 nodes, it uses only one node at a time to load a file and rest of the 7 nodes are doing nothing. How to make sure Spark uses or distributes load across all nodes.
one thing I did and which helped me is I repartitioned the RDD with 20K( pairRDD.repartition(20000)) and then spark can able to use all the 8 nodes to write to ES index.
Can someone please clarify is this the way Spark works or am I missing something or understood incorrectly.
Note this issue has not occurred before , and I used the same code.
Few differences are,
ES version previously 1.7 and not 5.4
No nested mapping before and now it has nested mapping
no niofs and now we have niofs
previously when i write to s3 it generates 56GB of 17 files and now its only 29 GB of same file #s and some extra 95B files which has "SEQ�!org.apache.hadoop.io.NullWritable"org.apache.hadoop.io.BytesWritable������vcö≈p$$ãÆ�c»øoüŒ". this content in it.
Thanks in advance.