Saving dataframe to ElasticSearch

Hugh_McBride · October 21, 2019, 11:32am

I am on a new project that is saving spark dataframes to elastic search using

dataFrame.saveToEsI(.... )

There are multi terabytes of Data (> 10) that are being saved to a rather small cluster (5 node ) ,
And it is taking on the order of a week to load, Is this method of saving data to elastic search suitable for this volume of data, (I am guessing not ) or would it be better to save the dataframes to json ( the spark/hadoop cluster is much much bigger ) first and then bulkload

james.baiera · November 14, 2019, 7:23pm

The saving process already converts data into json and performs successive bulkloading to Elasticsearch, so doing this outside of the connector probably won't see any changes to write speed. Ten-plus terrabytes seems like a large amount of data for a 5 node cluster, though I'm not familiar with your deployment so it's hard for me to say it's unreasonable.

One thing that might make sense to take a look at is the batch size settings on ES-Hadoop. Increasing these may allow you to send larger bulk requests and cut down on high amounts of back and forth traffic.

Additionally, disabling refresh on your indices and in ES-Hadoop may help the ingestion rate in that Elasticsearch may spend less time flushing smaller segments to disk.

Hugh_McBride · November 15, 2019, 11:15am

James, I suspected the elastic cluster being a bit on the small side, thanks for your response.

vr
hmb

system · December 13, 2019, 11:15am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Can es-hadoop write bulk files to disk? Elasticsearch es-hadoop	2	755	July 6, 2017
Load data into HDFS using ES-Spark Elasticsearch	2	580	July 6, 2017
Load data into HDFS using ES-Spark Elasticsearch es-hadoop	2	2000	July 6, 2017
Staging Data for Elasticsearch bulk loading Elasticsearch es-hadoop	2	1740	July 6, 2017
Bulk documents into Elasticsearch with pyspark Elasticsearch	1	742	November 23, 2017

Saving dataframe to ElasticSearch

Related topics