We are facing issues with the Elasticsearch in the production environment where Elasticsearch stops responding intermittently and service needs to be restarted in-order to recover from this state.
Logs collected from Elasticsearch are as follows
When the issue occurs we start seeing following logs, after this all API calls fail
{ElasticsearchLogger} [o.e.a.a.i.s.TransportIndicesStatsAction] failed to execute [indices:monitor/stats] on node [JD-iowTKQR-o_5hEjJSNow]
{ElasticsearchLogger} org.elasticsearch.transport.ReceiveTimeoutTransportException: [GT-EVAULTIDX01][127.0.0.1:9201][indices:monitor/stats[n]] request_id [416253] timed out after [14847ms]
{ElasticsearchLogger} [o.e.c.InternalClusterInfoService] failed to retrieve stats for node [JD-iowTKQR-o_5hEjJSNow]
{ElasticsearchLogger} org.elasticsearch.transport.ReceiveTimeoutTransportException: [GT-EVAULTIDX01][127.0.0.1:9201][cluster:monitor/nodes/stats[n]] request_id [416252] timed out after [14847ms]
After this we saw some health check related failures
{ElasticsearchLogger} [o.e.m.f.FsHealthService] health check failed
{ElasticsearchLogger} java.lang.IllegalStateException: environment is not locked
{ElasticsearchLogger} at org.elasticsearch.env.NodeEnvironment.assertEnvIsLocked(NodeEnvironment.java:1106) ~[elasticsearch-7.16.3.jar:7.16.3]
{ElasticsearchLogger} at org.elasticsearch.env.NodeEnvironment.nodeDataPaths(NodeEnvironment.java:865) ~[elasticsearch-7.16.3.jar:7.16.3]
{ElasticsearchLogger} at org.elasticsearch.monitor.fs.FsHealthService$FsHealthMonitor.monitorFSHealth(FsHealthService.java:156) [elasticsearch-7.16.3.jar:7.16.3]
{ElasticsearchLogger} at org.elasticsearch.monitor.fs.FsHealthService$FsHealthMonitor.run(FsHealthService.java:144) [elasticsearch-7.16.3.jar:7.16.3]
{ElasticsearchLogger} at org.elasticsearch.threadpool.Scheduler$ReschedulingRunnable.doRun(Scheduler.java:214) [elasticsearch-7.16.3.jar:7.16.3]
{ElasticsearchLogger} at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777) [elasticsearch-7.16.3.jar:7.16.3]
{ElasticsearchLogger} at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.16.3.jar:7.16.3]
{ElasticsearchLogger} at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
{ElasticsearchLogger} at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
{ElasticsearchLogger} at java.lang.Thread.run(Thread.java:829) [?:?]
{ElasticsearchLogger} Caused by: java.io.IOException: An unexpected network error occurred
It shows "An unexpected network error occurred", not sure that does it mean here.
After this point Elasticsearch becomes unresponsive and it does not revert back for the API calls.
Currently, we are not sure what is causing the problem to the Elasticsearch.
Earlier we thought that the Elasticsearch is becoming unresponsive because of lack of resources so we increased the JVM Heap size from 12 GB to 30 GB, but still we are facing the same issue.
When the issue occur, sometime we have seen following errors as well when we try to communicate with Elasticsearch
http://localhost:9200/_cat/shards?v
"error": {
"root_cause": [
{
"type": "master_not_discovered_exception", "reason": null }
"type": "master_not_discovered_exception", "reason": null
"status": 503
}
Need suggestions how we can troubleshoot and resolve this issue
Production environment details:
- This is a single node cluster.
- OS: Windows 2019 Server Standard (full installation) 64-bit
- Physical Memory (RAM): Total: 127.99 GB
- CPU: 16 logical CPUs (8 cores in 2 physical packages}
- Disk where Elasticsearch indices are stored: I:\ Fixed NTFS 1.95 TB 1.29 TB GT2IDX01_ES_INDEX_I
- Disk where Elasticsearch snapshots are stored: X:\ Fixed NTFS 1.99 TB 259.85 GB GT2IDX01_ES_SNAPVOL