In an application, I fetch rows from multiple tables over distinct JDBC connections in parallel, transform rows into document fields, and send it to ES in batches of size 1000 using the Java Bulk API. There I set the timeout of the bulk inserts to 15s and repeat at most 3 times on failure. The entire process takes >6 hours with max. 6 fetches in parallel on an ES cluster of 3 beefy VMs. (1 data, 1 master node on each VM.) But occasionally some bulk inserts fail even after retries. How should I diagnose and tackle this problem? (For the records, index is created using indices.store.throttle.type=none, number_of_shards=6, number_of_replicas=0, index.refresh_interval=-1, and translog.disable_flush=true. Upon successful completion, we revert these to production settings.)
Sorry for the misunderstanding. By "fail", I do mean that my wrapper Hystrix command timeouts after 15s. I even tried increasing timeout threshold to 30s. Even then, after a certain amount of inserts, occasionally some inserts just keep on waiting.
See the idle state in the translog size? That's where the entire ES cluster gets busy with flushing the translog, which in the meantime causes >2m delays in our bulk inserts. Is translog.disable_flush=true not doing why I do expect it to do, or am I misinterpreting its function?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.