ElasticSearch freezes suddenly and unable to respond

Hi,

We have ES OSS 7.0 running for more than couple of months. In last 15 days, ES froze and became very slow. It is almost unable to process any request.

Setup:
Single Node, Windows VM, 16 cores, 32GB RAM, ES Xmx 4G, Total ES indices data ~2GB.

Symptoms:
During both of the occurences, we noticed below:

  1. ES logs start showing slowness symptom with following lines
    failed to update shard information for clusterinfoupdatejob within 15s timeout
    failed to update node information for clusterinfoupdatejob within 15s timeout

  2. After sometime, index requests start getting timed out. We use ES high level rest client which starts getting timed out after 30 seconds.

  3. After sometime of this, ES start rejecting request as its request_queue size becomes full (200).

We did monitor the machine and ES at the time of issue:
ES -> All shards are yellow or green. No red shards. Querying /hot_threads api gets timed out. Even sometime, node statistics are timed out or unresponsive. Checking node statistics data, we did not find anything alarming except rejection count.

Windows -> We don't see any high pressure on cpu, memory. Disk I/O is also quite OK. Nothing extraordinary here, except some antivirus running.

Logs:

[WARN ][o.e.c.InternalClusterInfoService] Failed to update node information for ClusterInfoUpdateJob within 15s timeout
[ ERROR][o.e.a.b.TransportBulkAction] failed to execute pipeline for a bulk request

org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of [org.elasticsearch.ingest.IngestService$4@13da6e35](mailto:org.elasticsearch.ingest.IngestService$4@13da6e35) on EsThreadPoolExecutor[name = olawpa-ecadm00.ad.garmin.com/write, queue capacity = 200, [org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@bed0d28[Running](mailto:org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@bed0d28[Running), pool size = 16, active threads = 16, queued tasks = 200, completed tasks = 1158236]]

We are unable to get around what can cause this issue. It seems that ES is stuck or very slow due to some reason and unable to recover even after 2-3 days.

/_cluster/stats ouutput at the time of issue: ES _/cluster/stats - Pastebin.com

How can we further debug on this issue?

How much data do you have in the cluster? How many indices and shards do you have in the cluster?

Hi,

Total data <3GB
# of Indices -> 15, one shard per index
One node in a cluster.

7.0 is long past EOL, please upgrade ASAP.

What is the output from the _cluster/stats?pretty&human API?

Hi @MarkWalkom, I have updated the question with _cluster/stats o/p. I don't see anything out of oridnary here except the rejected count for indexing. We are planning to upgrade to ES7.10 but want to make sure that we understand the issue first.

You would be far better off upgrading to 7.14, as it's latest.

Ok, we will check on that.
Anyway, any insight on this issue?

So many things have been fixed since 7.0. I'd upgrade and see if this fixes the problem you are seeing.

I would recommend temporarily disabling antivirus to see if that has any impact. Have seen some types of antivirus have huge impact on write performance in the past and indexing is quite I/O intensive.

Ok. We will check after upgrading to latest ES version.
At the time of issue, I can see from windows event log, antivirus was running. And antivirus run for 3-4 hours and then stopped.
Is it possible that it permanently slowed down ES and ES could not recover fom it even after 2-3 days?
We need to restart ES to solve the issue.