We have ES OSS 7.0 running for more than couple of months. In last 15 days, ES froze and became very slow. It is almost unable to process any request.
Setup: Single Node, Windows VM, 16 cores, 32GB RAM, ES Xmx 4G, Total ES indices data ~2GB.
During both of the occurences, we noticed below:
ES logs start showing slowness symptom with following lines
failed to update shard information for clusterinfoupdatejob within 15s timeout
failed to update node information for clusterinfoupdatejob within 15s timeout
After sometime, index requests start getting timed out. We use ES high level rest client which starts getting timed out after 30 seconds.
After sometime of this, ES start rejecting request as its request_queue size becomes full (200).
We did monitor the machine and ES at the time of issue:
ES -> All shards are yellow or green. No red shards. Querying /hot_threads api gets timed out. Even sometime, node statistics are timed out or unresponsive. Checking node statistics data, we did not find anything alarming except rejection count.
Windows -> We don't see any high pressure on cpu, memory. Disk I/O is also quite OK. Nothing extraordinary here, except some antivirus running.
[WARN ][o.e.c.InternalClusterInfoService] Failed to update node information for ClusterInfoUpdateJob within 15s timeout [ ERROR][o.e.a.b.TransportBulkAction] failed to execute pipeline for a bulk request org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of [org.elasticsearch.ingest.IngestService$4@13da6e35](mailto:org.elasticsearch.ingest.IngestService$4@13da6e35) on EsThreadPoolExecutor[name = olawpa-ecadm00.ad.garmin.com/write, queue capacity = 200, [org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@bed0d28[Running](mailto:org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@bed0d28[Running), pool size = 16, active threads = 16, queued tasks = 200, completed tasks = 1158236]]
We are unable to get around what can cause this issue. It seems that ES is stuck or very slow due to some reason and unable to recover even after 2-3 days.
/_cluster/stats ouutput at the time of issue: ES _/cluster/stats - Pastebin.com
How can we further debug on this issue?