I am on a new project that is saving spark dataframes to elastic search using
dataFrame.saveToEsI(.... )
There are multi terabytes of Data (> 10) that are being saved to a rather small cluster (5 node ) ,
And it is taking on the order of a week to load, Is this method of saving data to elastic search suitable for this volume of data, (I am guessing not ) or would it be better to save the dataframes to json ( the spark/hadoop cluster is much much bigger ) first and then bulkload
The saving process already converts data into json and performs successive bulkloading to Elasticsearch, so doing this outside of the connector probably won't see any changes to write speed. Ten-plus terrabytes seems like a large amount of data for a 5 node cluster, though I'm not familiar with your deployment so it's hard for me to say it's unreasonable.
One thing that might make sense to take a look at is the batch size settings on ES-Hadoop. Increasing these may allow you to send larger bulk requests and cut down on high amounts of back and forth traffic.
Additionally, disabling refresh on your indices and in ES-Hadoop may help the ingestion rate in that Elasticsearch may spend less time flushing smaller segments to disk.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.