Saving dataframe to ElasticSearch

james.baiera · November 14, 2019, 7:23pm

The saving process already converts data into json and performs successive bulkloading to Elasticsearch, so doing this outside of the connector probably won't see any changes to write speed. Ten-plus terrabytes seems like a large amount of data for a 5 node cluster, though I'm not familiar with your deployment so it's hard for me to say it's unreasonable.

One thing that might make sense to take a look at is the batch size settings on ES-Hadoop. Increasing these may allow you to send larger bulk requests and cut down on high amounts of back and forth traffic.

Additionally, disabling refresh on your indices and in ES-Hadoop may help the ingestion rate in that Elasticsearch may spend less time flushing smaller segments to disk.

Topic		Replies	Views
Can es-hadoop write bulk files to disk? Elasticsearch es-hadoop	2	755	July 6, 2017
Load data into HDFS using ES-Spark Elasticsearch	2	580	July 6, 2017
Load data into HDFS using ES-Spark Elasticsearch es-hadoop	2	2000	July 6, 2017
Staging Data for Elasticsearch bulk loading Elasticsearch es-hadoop	2	1740	July 6, 2017
Bulk documents into Elasticsearch with pyspark Elasticsearch	1	742	November 23, 2017

Saving dataframe to ElasticSearch

Related topics