Elasticsearch bails out ES-HADOOP PLUGIN

we are using ES-HADOOP plugin to push data into Elasticsearch cluster from Hadoop HBASE table. below are the cluster details.

ELASTICSEARCH VERSION: 2.3.5
DATA NODES: 3
MASTER NODES: 3
CLIENT NODE: 1

the data nodes are master nodes as well.
DATA/MASTER NODES HEAP: 20GB
CLIENT NODES HEAP: 3GB

Number of Primary Shards per Index: 5
Number of Replica Shards per index: 1

when we execute jobs on Spark and on stages where we push data from Hadoop to Elasticsearch after some time we start getting 'ElasticSearch Bailing Out'.

we suspect that number of concurrent connections which Elasticsearch can process for Bulk API is exceeding by the Spark Executors.

Kindly suggest us how we can identify that with above mentioned configuration how much concurrent bulk API connection can ElasticSearch Client node can process and successfully write the data and what should be the maximum number of documents per BULK API REQUEST.

also what parameters we shall look into optimise the ElasticSearch cluster for write operations where we need to index 80-90 GB data in a hour.

A good starting place to start reading advice would be our docs section on performance tuning, primarily the section about tuning write operations. A key thing to keep in mind is that there are a set number of bulk threads on Elasticsearch to handle requests, and once those are used up, the bulk requests go into a work queue while they wait for more threads to become available to handle them. These queues are per-node, and if the number of concurrent operations is high enough, they can fill up and reject your requests. My advice is to set up some instrumentation around your ingestion process (my go-to is jVisualVM) and check how long the process is waiting on the network/ES versus how much work it's doing in-process. Often times it's better to decrease the number of concurrent writers since more time is spent waiting in a queue than doing any actual work. These settings are going to be different for everyone, so it really is super important to collect performance data while you tune these settings.

is there any way by which I can change the logging config on Elasticsearch so that we can get the time spended by ES on each bulk write and if bulk write fails then we get the reason.

If you are on ES-Hadoop version 6.2.X+ you can set the logging level to DEBUG for the org.elasticsearch.hadoop.rest.bulk package. This should report when the connector is flushing, including when it encounters errors (even errors that it decides to retry). There isn't any logging right now for when a bulk request ends though. I'll see about adding something for that.

I opened https://github.com/elastic/elasticsearch-hadoop/issues/1122

thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.