Indexing data in bulk in Elasticsearch using PySpark

(Ronak Nathani) #1


I have ~1TB of data in S3 which I am doing some transformations on using PySpark and then trying to write the result to Elasticsearch. I came across using the ES-Hadoop package however, had to set of es.batch.write.retry.count to -1 to be able to keep writing data successfully. The documentation says this can have potential side effects. I am curious as to what those are. Any clarification would be appreciated.

Also, I figured that one other way to write data to Elasticsearch using PySpark is using the foreachPartition function on Spark's RDD along with the Elasticsearch Python client's bulk api to iterate over elements and index them.

Is one of these method better than the other?


(system) #2