Saving dataframe to ElasticSearch

I am on a new project that is saving spark dataframes to elastic search using

dataFrame.saveToEsI(.... )

There are multi terabytes of Data (> 10) that are being saved to a rather small cluster (5 node ) ,
And it is taking on the order of a week to load, Is this method of saving data to elastic search suitable for this volume of data, (I am guessing not ) or would it be better to save the dataframes to json ( the spark/hadoop cluster is much much bigger ) first and then bulkload

The saving process already converts data into json and performs successive bulkloading to Elasticsearch, so doing this outside of the connector probably won't see any changes to write speed. Ten-plus terrabytes seems like a large amount of data for a 5 node cluster, though I'm not familiar with your deployment so it's hard for me to say it's unreasonable.

One thing that might make sense to take a look at is the batch size settings on ES-Hadoop. Increasing these may allow you to send larger bulk requests and cut down on high amounts of back and forth traffic.

Additionally, disabling refresh on your indices and in ES-Hadoop may help the ingestion rate in that Elasticsearch may spend less time flushing smaller segments to disk.

James, I suspected the elastic cluster being a bit on the small side, thanks for your response.


This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.