Elastic Search becomes unresponsive after some time

We are facing issues with the Elasticsearch in the production environment where Elasticsearch stops responding intermittently and service needs to be restarted in-order to recover from this state.

Logs collected from Elasticsearch are as follows

When the issue occurs we start seeing following logs, after this all API calls fail

{ElasticsearchLogger} [o.e.a.a.i.s.TransportIndicesStatsAction] failed to execute [indices:monitor/stats] on node [JD-iowTKQR-o_5hEjJSNow]
{ElasticsearchLogger} org.elasticsearch.transport.ReceiveTimeoutTransportException: [GT-EVAULTIDX01][127.0.0.1:9201][indices:monitor/stats[n]] request_id [416253] timed out after [14847ms]
{ElasticsearchLogger} [o.e.c.InternalClusterInfoService] failed to retrieve stats for node [JD-iowTKQR-o_5hEjJSNow]
{ElasticsearchLogger} org.elasticsearch.transport.ReceiveTimeoutTransportException: [GT-EVAULTIDX01][127.0.0.1:9201][cluster:monitor/nodes/stats[n]] request_id [416252] timed out after [14847ms]

After this we saw some health check related failures

{ElasticsearchLogger} [o.e.m.f.FsHealthService] health check failed
{ElasticsearchLogger} java.lang.IllegalStateException: environment is not locked
{ElasticsearchLogger} at org.elasticsearch.env.NodeEnvironment.assertEnvIsLocked(NodeEnvironment.java:1106) ~[elasticsearch-7.16.3.jar:7.16.3]
{ElasticsearchLogger} at org.elasticsearch.env.NodeEnvironment.nodeDataPaths(NodeEnvironment.java:865) ~[elasticsearch-7.16.3.jar:7.16.3]
{ElasticsearchLogger} at org.elasticsearch.monitor.fs.FsHealthService$FsHealthMonitor.monitorFSHealth(FsHealthService.java:156) [elasticsearch-7.16.3.jar:7.16.3]
{ElasticsearchLogger} at org.elasticsearch.monitor.fs.FsHealthService$FsHealthMonitor.run(FsHealthService.java:144) [elasticsearch-7.16.3.jar:7.16.3]
{ElasticsearchLogger} at org.elasticsearch.threadpool.Scheduler$ReschedulingRunnable.doRun(Scheduler.java:214) [elasticsearch-7.16.3.jar:7.16.3]
{ElasticsearchLogger} at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777) [elasticsearch-7.16.3.jar:7.16.3]
{ElasticsearchLogger} at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.16.3.jar:7.16.3]
{ElasticsearchLogger} at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
{ElasticsearchLogger} at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
{ElasticsearchLogger} at java.lang.Thread.run(Thread.java:829) [?:?]
{ElasticsearchLogger} Caused by: java.io.IOException: An unexpected network error occurred

It shows "An unexpected network error occurred", not sure that does it mean here.

After this point Elasticsearch becomes unresponsive and it does not revert back for the API calls.

Currently, we are not sure what is causing the problem to the Elasticsearch.

Earlier we thought that the Elasticsearch is becoming unresponsive because of lack of resources so we increased the JVM Heap size from 12 GB to 30 GB, but still we are facing the same issue.

When the issue occur, sometime we have seen following errors as well when we try to communicate with Elasticsearch

http://localhost:9200/_cat/shards?v

"error": {
"root_cause": [
{
"type": "master_not_discovered_exception", "reason": null }
"type": "master_not_discovered_exception", "reason": null
"status": 503
}

Need suggestions how we can troubleshoot and resolve this issue

Production environment details:

  • This is a single node cluster.
  • OS: Windows 2019 Server Standard (full installation) 64-bit
  • Physical Memory (RAM): Total: 127.99 GB
  • CPU: 16 logical CPUs (8 cores in 2 physical packages}
  • Disk where Elasticsearch indices are stored: I:\ Fixed NTFS 1.95 TB 1.29 TB GT2IDX01_ES_INDEX_I
  • Disk where Elasticsearch snapshots are stored: X:\ Fixed NTFS 1.99 TB 259.85 GB GT2IDX01_ES_SNAPVOL

We'd really appreciate any guidance/suggestions for this issue as the end customer has a significant number of users that are unable to work when these issues occur and their temperature continues to increase the longer we go without a solution.

Yes this is the problem, you're using network-attached storage on Windows. See these docs:

Elasticsearch requires the filesystem to act as if it were backed by a local disk, but this means that it will work correctly on properly-configured remote block devices (e.g. a SAN) and remote filesystems (e.g. NFS) as long as the remote storage behaves no differently from local storage.

Windows remote storage behaves sufficiently differently from local storage that you can't reliably use it with Elasticsearch.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.