Occasional Bulk Insert Failure (ES 2.4.4)


(Volkan Yazıcı) #1

Hi all!

In an application, I fetch rows from multiple tables over distinct JDBC connections in parallel, transform rows into document fields, and send it to ES in batches of size 1000 using the Java Bulk API. There I set the timeout of the bulk inserts to 15s and repeat at most 3 times on failure. The entire process takes >6 hours with max. 6 fetches in parallel on an ES cluster of 3 beefy VMs. (1 data, 1 master node on each VM.) But occasionally some bulk inserts fail even after retries. How should I diagnose and tackle this problem? (For the records, index is created using indices.store.throttle.type=none, number_of_shards=6, number_of_replicas=0, index.refresh_interval=-1, and translog.disable_flush=true. Upon successful completion, we revert these to production settings.)

Best.


(Alexander Reelsen) #2

Hey,

can you provide more information while the bulk inserts failed? Did they fail because of a server or client issue? Can you provide the responses?

--Alex


(Volkan Yazıcı) #3

Hey Alex!

Sorry for the misunderstanding. By "fail", I do mean that my wrapper Hystrix command timeouts after 15s. I even tried increasing timeout threshold to 30s. Even then, after a certain amount of inserts, occasionally some inserts just keep on waiting.

Best.


(Alexander Reelsen) #4

Hey,

have you checked your Elasticsearch logs during that time? Is there a node doing garbage collection maybe? You will find that in the logs.

--Alex


(Volkan Yazıcı) #5

I think we found the culprit: translog flushes. Although I set translog.disable_flush=true, apparently ES still prefers to do some:

See the idle state in the translog size? That's where the entire ES cluster gets busy with flushing the translog, which in the meantime causes >2m delays in our bulk inserts. Is translog.disable_flush=true not doing why I do expect it to do, or am I misinterpreting its function?


(Alexander Reelsen) #6

Hey,

is there any reason you decided to disable the flushing of the translog in the first place? This option was removed (and only useful in tests anyway).

--Alex


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.