Elasticsearch throws error 503 Server Unavailable

I am facing one issue with the Elasticsearch in the production environment.
Elasticsearch stops responding to the API calls and it needs to be restarted.

Logs collected from Elasticsearch are as follows

When the issue occurs we start seeing following logs, after this all API calls fail

[o.e.a.a.i.s.TransportIndicesStatsAction] failed to execute [indices:monitor/stats] on node [JD-iowTKQR-o_5hEjJSNow]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [GT-EVAULTIDX01][127.0.0.1:9201][indices:monitor/stats[n]] request_id [416253] timed out after [14847ms]
[o.e.c.InternalClusterInfoService] failed to retrieve stats for node [JD-iowTKQR-o_5hEjJSNow]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [GT-EVAULTIDX01][127.0.0.1:9201][cluster:monitor/nodes/stats[n]] request_id [416252] timed out after [14847ms]

I am seeing some IO Exceptions, it says "An unexpected network error occurred".
Does that mean the because of some network issue Elasticsearch is not working as expected?

[o.e.m.f.FsHealthService] health check failed
java.lang.IllegalStateException: environment is not locked
at org.elasticsearch.env.NodeEnvironment.assertEnvIsLocked(NodeEnvironment.java:1106) ~[elasticsearch-7.16.3.jar:7.16.3]
at org.elasticsearch.env.NodeEnvironment.nodeDataPaths(NodeEnvironment.java:865) ~[elasticsearch-7.16.3.jar:7.16.3]
at org.elasticsearch.monitor.fs.FsHealthService$FsHealthMonitor.monitorFSHealth(FsHealthService.java:156) [elasticsearch-7.16.3.jar:7.16.3]
at org.elasticsearch.monitor.fs.FsHealthService$FsHealthMonitor.run(FsHealthService.java:144) [elasticsearch-7.16.3.jar:7.16.3]
at org.elasticsearch.threadpool.Scheduler$ReschedulingRunnable.doRun(Scheduler.java:214) [elasticsearch-7.16.3.jar:7.16.3]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777) [elasticsearch-7.16.3.jar:7.16.3]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.16.3.jar:7.16.3]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.io.IOException: An unexpected network error occurred

After this point Elasticsearch becomes unresponsive and it throws excetion to the API calls

I am seeing following logs multiple time

[r.suppressed] path: /_cat/nodes, params: {master_timeout=30s}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.onTimeout(TransportMasterNodeAction.java:297) [elasticsearch-7.16.3.jar:7.16.3]
at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:345) [elasticsearch-7.16.3.jar:7.16.3]
at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:263) [elasticsearch-7.16.3.jar:7.16.3]
at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:660) [elasticsearch-7.16.3.jar:7.16.3]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.16.3.jar:7.16.3]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
[o.e.i.b.request] [request] Adjusted breaker by [16440] bytes, now [16440]
[o.e.i.b.in_flight_requests] [in_flight_requests] Adjusted breaker by [0] bytes, now [0]
[o.e.h.HttpTracer] [37512][null][SERVICE_UNAVAILABLE][application/json; charset=UTF-8][151] sent response to [Netty4HttpChannel{localAddress=/127.0.0.1:9200, remoteAddress=/127.0.0.1:18391}] success [true]

Elasticsearch throws the 503 Server Unavailable error for any API call.

http://localhost:9200/_cat/shards?v

"error": {
"root_cause": [
{
"type": "master_not_discovered_exception", "reason": null }
"type": "master_not_discovered_exception", "reason": null
"status": 503
}

Currently, I not sure what is causing the problem to the Elasticsearch.

Need suggestions how I can troubleshoot and resolve this issue

What is the size and configuration of the cluster?

This suggests three things to me:

  1. You are running on Windows,
  2. The Elasticsearch data path on some kind of network-attached storage, and
  3. This network-attached storage is not working reliably enough.

Does that sound plausible?

1 Like

See these docs for more information:

Elasticsearch requires the filesystem to act as if it were backed by a local disk, but this means that it will work correctly on properly-configured remote block devices (e.g. a SAN) and remote filesystems (e.g. NFS) as long as the remote storage behaves no differently from local storage.

In particular, if you are using a remote storage device and that device returns a network error then that is not "behaves no differently from local storage", so we wouldn't expect Elasticsearch to work correctly in this situation.

Thanks David for the response.

I am checking the network and storage and trying to figure out what is going wrong in the environment.

I am trying to understand what are the parameters that I can check to make sure that the remote storage is behaving no differently from local storage.

It will helpful if you could tell me what should I monitor from network and storage side?

I'm not familiar with remote storage in Windows so I can't really answer that. But if your storage generates an error saying An unexpected network error occurred then it's not behaving like local storage.

Understood.
Could you please elaborate more on "it's not behaving like local storage"?
What is expected to behave any storage like local storage?

Local storage means something like a hard disk directly attached to your system. Such a setup would never experience an unexpected network error since there is no network on the path to the storage device.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.