Elastic Search becomes unresponsive after some time

EVINDX · November 30, 2023, 9:14am

We are facing issues with the Elasticsearch in the production environment where Elasticsearch stops responding intermittently and service needs to be restarted in-order to recover from this state.

Logs collected from Elasticsearch are as follows

When the issue occurs we start seeing following logs, after this all API calls fail

{ElasticsearchLogger} [o.e.a.a.i.s.TransportIndicesStatsAction] failed to execute [indices:monitor/stats] on node [JD-iowTKQR-o_5hEjJSNow]
{ElasticsearchLogger} org.elasticsearch.transport.ReceiveTimeoutTransportException: [GT-EVAULTIDX01][127.0.0.1:9201][indices:monitor/stats[n]] request_id [416253] timed out after [14847ms]
{ElasticsearchLogger} [o.e.c.InternalClusterInfoService] failed to retrieve stats for node [JD-iowTKQR-o_5hEjJSNow]
{ElasticsearchLogger} org.elasticsearch.transport.ReceiveTimeoutTransportException: [GT-EVAULTIDX01][127.0.0.1:9201][cluster:monitor/nodes/stats[n]] request_id [416252] timed out after [14847ms]

After this we saw some health check related failures

{ElasticsearchLogger} [o.e.m.f.FsHealthService] health check failed
{ElasticsearchLogger} java.lang.IllegalStateException: environment is not locked
{ElasticsearchLogger} at org.elasticsearch.env.NodeEnvironment.assertEnvIsLocked(NodeEnvironment.java:1106) ~[elasticsearch-7.16.3.jar:7.16.3]
{ElasticsearchLogger} at org.elasticsearch.env.NodeEnvironment.nodeDataPaths(NodeEnvironment.java:865) ~[elasticsearch-7.16.3.jar:7.16.3]
{ElasticsearchLogger} at org.elasticsearch.monitor.fs.FsHealthService$FsHealthMonitor.monitorFSHealth(FsHealthService.java:156) [elasticsearch-7.16.3.jar:7.16.3]
{ElasticsearchLogger} at org.elasticsearch.monitor.fs.FsHealthService$FsHealthMonitor.run(FsHealthService.java:144) [elasticsearch-7.16.3.jar:7.16.3]
{ElasticsearchLogger} at org.elasticsearch.threadpool.Scheduler$ReschedulingRunnable.doRun(Scheduler.java:214) [elasticsearch-7.16.3.jar:7.16.3]
{ElasticsearchLogger} at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777) [elasticsearch-7.16.3.jar:7.16.3]
{ElasticsearchLogger} at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.16.3.jar:7.16.3]
{ElasticsearchLogger} at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
{ElasticsearchLogger} at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
{ElasticsearchLogger} at java.lang.Thread.run(Thread.java:829) [?:?]
{ElasticsearchLogger} Caused by: java.io.IOException: An unexpected network error occurred

It shows "An unexpected network error occurred", not sure that does it mean here.

After this point Elasticsearch becomes unresponsive and it does not revert back for the API calls.

Currently, we are not sure what is causing the problem to the Elasticsearch.

Earlier we thought that the Elasticsearch is becoming unresponsive because of lack of resources so we increased the JVM Heap size from 12 GB to 30 GB, but still we are facing the same issue.

When the issue occur, sometime we have seen following errors as well when we try to communicate with Elasticsearch

http://localhost:9200/_cat/shards?v

"error": {
"root_cause": [
{
"type": "master_not_discovered_exception", "reason": null }
"type": "master_not_discovered_exception", "reason": null
"status": 503
}

Need suggestions how we can troubleshoot and resolve this issue

Production environment details:

This is a single node cluster.
OS: Windows 2019 Server Standard (full installation) 64-bit
Physical Memory (RAM): Total: 127.99 GB
CPU: 16 logical CPUs (8 cores in 2 physical packages}
Disk where Elasticsearch indices are stored: I:\ Fixed NTFS 1.95 TB 1.29 TB GT2IDX01_ES_INDEX_I
Disk where Elasticsearch snapshots are stored: X:\ Fixed NTFS 1.99 TB 259.85 GB GT2IDX01_ES_SNAPVOL

EVINDX · November 30, 2023, 5:02pm

We'd really appreciate any guidance/suggestions for this issue as the end customer has a significant number of users that are unable to work when these issues occur and their temperature continues to increase the longer we go without a solution.

DavidTurner · December 20, 2023, 7:34am

Yes this is the problem, you're using network-attached storage on Windows. See these docs:

Elasticsearch requires the filesystem to act as if it were backed by a local disk, but this means that it will work correctly on properly-configured remote block devices (e.g. a SAN) and remote filesystems (e.g. NFS) as long as the remote storage behaves no differently from local storage.

Windows remote storage behaves sufficiently differently from local storage that you can't reliably use it with Elasticsearch.

system · January 17, 2024, 7:34am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch throws error 503 Server Unavailable Elasticsearch	8	1853	January 9, 2024
Elasticsearch throws 503 Server Unavailable error Elasticsearch	2	180	December 8, 2023
Http/Transport module is inresponsive during high activity Elasticsearch	2	712	July 5, 2017
[o.e.t.TransportService] Received response for a request that has timed out Elasticsearch	17	689	August 28, 2023
API timeout Elasticsearch	1	401	February 21, 2019

Elastic Search becomes unresponsive after some time

Related topics