Elasticsearch bulk indexing issue

I had issue that elasticsearch would become not responsive after few bulk index calls. I solved it by setting refresh_interval to -1 before I call bulk api and set to 1s after i'm done ( Strange issue with Elasticsearch while bulk indexing). Even tho it solved responsiveness issue, later I found out there still is another problem. After few successful bulk index calls elasticsearch can't index anything for good 20 minutes if not more. at first each bulk index call would consist of 10000x 2kb documents. then in my last topic's reply I got pointed out that I should aim for 5MB per request so I changed each call to 1000x 2kb documents(2mb) after that problem persisted and I shrunk my call to 500x2kb. Even tho it didn't fix my issue, I noticed pattern less data I requested per bulk insert, longer it took elasticsearch to go not responsive. I dont know exactly but I think it's the storage. I am indexing documents to SSHD so I suspect once solid state cache runs out HDD part of drive is too slow to do anything and elasticsearch becomes unresponsive until solid state cache is free. I could easily confirm this theory by running tests on pure SSD(sata/nvme) but unfortunately I don't have environment where I could do so. If I go out and buy SSD and problem turns out to be something else it would be wasted money so I'd love 2nd opinion on this.

1 Like

Seeing something similar with our ES 7.4.1. What do you mean by not responsive?

Also are you seeing any 429 errors returned when bulk indexing? Meaning your ingestion should slow down. We do not oddly.

1 Like

I am running 7.5.2.

I am using NEST library for C#. if you see function I posted in my previous topic Strange issue with Elasticsearch while bulk indexing it has eventhandlers for onNext, onError, onCompleted. After I successfully bulk index few times and this behaviour starts, neither of those events gets triggered, no exception gets thrown and code execution continues. Kibana stops working because elasticsearch response time gets above 30000ms or some similar number. On top of that, before I started to set refresh_interval to -1 once I'd request http://domain:9200/index/_count it would load my request after several minutes, now it loads instantly but kibana is still not working. That's behaviour I call unresponsive. If there is different term for it my apologies English is not my primary language.

p.s what is your server's storage configuration if you dont mind sharing?

Very similar to what we are seeing (Kibana becoming unresponsive, any cluster state requests take a very long time to return)

We are running on SSD direct attached storage.

We see our management threadpool begin to queue up before the failures. But I have a question out as I don't see management threadpool in the 7.5 documentation.

If you run curl -XGET localhost:9200/_cat/thread_pool?v do you see management threads?

This is output :face_with_thermometer: only one management thread and many rejected write requests

node_name name active queue rejected
nodename analyze 0 0 0
nodename ccr 0 0 0
nodename fetch_shard_started 0 0 0
nodename fetch_shard_store 0 0 0
nodename flush 0 0 0
nodename force_merge 0 0 0
nodename generic 0 0 0
nodename get 0 0 0
nodename listener 0 0 0
nodename management 1 0 0
nodename ml_datafeed 0 0 0
nodename ml_job_comms 0 0 0
nodename ml_utility 0 0 0
nodename refresh 0 0 0
nodename rollup_indexing 0 0 0
nodename search 0 0 0
nodename search_throttled 0 0 0
nodename snapshot 0 0 0
nodename transform_indexing 0 0 0
nodename warmer 0 0 0
nodename watcher 0 0 0
nodename write 0 0 33062

Interesting to see management threads are still a thing :smiley:

But looks like this is different from ours. We were hoping to get some rejection so we could back off, but don't see any. Sorry I don't have more to offer.

It's alright I will have system I will be running & testing tomorrow. It has NVMe storage so if problem gets solved then its storage issue for me otherwise I will dive into this deeper.

1 Like

Remember to post back if it works :+1:

1 Like

If Elasticsearch appears to be too busy to respond to requests, the first thing I would recommend is the nodes hot threads API to find out what it is busy doing: GET /_nodes/hot_threads?threads=99999. Sometimes it gets so stuck that even that doesn't work, in which case you can use jmap to grab a thread dump directly from the JVM, assuming you can work out which nodes need investigation. Also, of course, if there are any messages in the logs from the time of the problem then they would be helpful.

If you need help understanding what the thread dumps or log messages mean, please share them here.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.