Occasional Bulk Insert Failure (ES 2.4.4)

vyazici · January 30, 2017, 9:32pm

Hi all!

In an application, I fetch rows from multiple tables over distinct JDBC connections in parallel, transform rows into document fields, and send it to ES in batches of size 1000 using the Java Bulk API. There I set the timeout of the bulk inserts to 15s and repeat at most 3 times on failure. The entire process takes >6 hours with max. 6 fetches in parallel on an ES cluster of 3 beefy VMs. (1 data, 1 master node on each VM.) But occasionally some bulk inserts fail even after retries. How should I diagnose and tackle this problem? (For the records, index is created using indices.store.throttle.type=none, number_of_shards=6, number_of_replicas=0, index.refresh_interval=-1, and translog.disable_flush=true. Upon successful completion, we revert these to production settings.)

Best.

spinscale · January 31, 2017, 8:56am

Hey,

can you provide more information while the bulk inserts failed? Did they fail because of a server or client issue? Can you provide the responses?

--Alex

vyazici · January 31, 2017, 6:02pm

Hey Alex!

Sorry for the misunderstanding. By "fail", I do mean that my wrapper Hystrix command timeouts after 15s. I even tried increasing timeout threshold to 30s. Even then, after a certain amount of inserts, occasionally some inserts just keep on waiting.

Best.

spinscale · February 1, 2017, 8:24am

Hey,

have you checked your Elasticsearch logs during that time? Is there a node doing garbage collection maybe? You will find that in the logs.

--Alex

vyazici · February 1, 2017, 12:46pm

I think we found the culprit: translog flushes. Although I set translog.disable_flush=true, apparently ES still prefers to do some:

See the idle state in the translog size? That's where the entire ES cluster gets busy with flushing the translog, which in the meantime causes >2m delays in our bulk inserts. Is translog.disable_flush=true not doing why I do expect it to do, or am I misinterpreting its function?

spinscale · February 1, 2017, 1:16pm

Hey,

is there any reason you decided to disable the flushing of the translog in the first place? This option was removed (and only useful in tests anyway).

--Alex

system · March 1, 2017, 1:16pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Bulk indexing failure/timeout while shard in translog recovery state Elasticsearch	1	773	July 5, 2017
ES bulk insert time out Elasticsearch	9	12504	July 6, 2017
Curl timeout during bulk insert Elasticsearch	5	1978	July 6, 2017
Bulk update times out Elasticsearch	3	323	July 6, 2017
Bulk timeout Elasticsearch	6	2109	October 4, 2017

Occasional Bulk Insert Failure (ES 2.4.4)

Related topics