Indexing data in bulk in Elasticsearch using PySpark

ronak · June 14, 2016, 12:49am

Hi,

I have ~1TB of data in S3 which I am doing some transformations on using PySpark and then trying to write the result to Elasticsearch. I came across using the ES-Hadoop package however, had to set of es.batch.write.retry.count to -1 to be able to keep writing data successfully. The documentation says this can have potential side effects. I am curious as to what those are. Any clarification would be appreciated.

Also, I figured that one other way to write data to Elasticsearch using PySpark is using the foreachPartition function on Spark's RDD along with the Elasticsearch Python client's bulk api to iterate over elements and index them.

Is one of these method better than the other?

Thanks!

Topic		Replies	Views
Bulk Operation Results from Databricks Spark Job Elasticsearch	3	482	May 30, 2019
Is it possible to perform bulk insert from Spark to ElasticSearch? Elasticsearch es-hadoop	4	6536	July 6, 2017
How to write to ES from a pyspark dataframe? Elasticsearch es-hadoop	5	5136	July 6, 2017
Writing Spark Dataframe into ElasticSeach- Runs Successfully but Not all Data dumped Elasticsearch es-hadoop	2	1267	January 4, 2022
org.elasticsearch.hadoop.EsHadoopException: Could not write all entries for bulk operation [1/1]. Error sample (first [5] error messages): Elasticsearch	1	229	April 8, 2024

Indexing data in bulk in Elasticsearch using PySpark

Related topics