Hi! I need help with the following escenario: I’m running Elasticsearch in cluster mode in 3 worker nodes in RKE. Suddenly stopped from working and Kibana went down as Elasticsearch presents 503 error service unavailable. The elasticsearch service has persistent volume. But reading logs I’m unable to understand what’s requesting in order to get up and running again.
You need to share the logs, without them is impossible to know what could be the issue.
Thanks for your reaply:
Node 1 and Node 2 last tail -100
{"type": "server", "timestamp": "2023-06-09T20:42:54,301Z", "level": "WARN", "component": "o.e.x.m.MonitoringService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "monitoring execution failed", "cluster.uuid": "XXXXXXXXXX", "node.id": "XXXXXXXXXX" ,
"stacktrace": ["org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks",
"at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.lambda$doFlush$0(ExportBulk.java:110) [x-pack-monitoring-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:144) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:142) [x-pack-monitoring-7.17.3.jar:7.17.3]",
"at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$0(LocalBulk.java:117) [x-pack-monitoring-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:136) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:88) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:82) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.ActionListener$RunBeforeActionListener.onResponse(ActionListener.java:389) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.ActionListener$MappedActionListener.onResponse(ActionListener.java:101) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.finishHim(TransportBulkAction.java:625) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.onFailure(TransportBulkAction.java:620) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:97) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.finishAsFailed(TransportReplicationAction.java:1041) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.doRun(TransportReplicationAction.java:818) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.replication.TransportReplicationAction.runReroutePhase(TransportReplicationAction.java:227) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.replication.TransportReplicationAction.doExecute(TransportReplicationAction.java:222) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.replication.TransportReplicationAction.doExecute(TransportReplicationAction.java:85) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:179) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.ActionFilter$Simple.apply(ActionFilter.java:53) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:177) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:154) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:82) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:590) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.bulk.TransportBulkAction.executeBulk(TransportBulkAction.java:736) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.bulk.TransportBulkAction.doInternalExecute(TransportBulkAction.java:279) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.bulk.TransportBulkAction$3.doRun(TransportBulkAction.java:814) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:288) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.action.bulk.TransportBulkAction.lambda$processBulkIndexIngestRequest$4(TransportBulkAction.java:829) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.ingest.IngestService.lambda$executePipelines$3(IngestService.java:751) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.ingest.IngestService.innerExecute(IngestService.java:833) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.ingest.IngestService.executePipelines(IngestService.java:700) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.ingest.IngestService.access$000(IngestService.java:82) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.ingest.IngestService$3.doRun(IngestService.java:662) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777) [elasticsearch-7.17.3.jar:7.17.3]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.17.3.jar:7.17.3]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]",
"at java.lang.Thread.run(Thread.java:833) [?:?]",
"Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: bulk [default_local] reports failures when exporting documents",
"at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:131) ~[?:?]",
And node 3:
However node 3 log:
{"type": "server", "timestamp": "2023-06-09T21:09:25,631Z", "level": "WARN", "component": "o.e.c.r.a.DiskThresholdMonitor", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-2", "message": "high disk watermark [90%] exceeded on [xxxxxxxxxxPWLLuw][elasticsearch-master-0][/usr/share/elasticsearch/data/nodes/0] free: 55.9gb[9.9%], shards will be relocated away from this node; currently relocating away shards totalling [0] bytes; the node is expected to continue to exceed the high disk watermark when these relocations are complete", "cluster.uuid": "xxxxxxxxezfMr6l7uzOA", "node.id": "xxxxxxV7egJQ" }
Can you format your logs using the Preformatted text option? the </>
? It makes easier to read.
The logs you shared from Node 1 and Node 2 are not helpful, do you have any ERROR or FATAL log logs? This is just a WARN.
If your cluster is not running you would have an ERROR or FATAL log line.
This log means that this node does not have enough space to work correctly and it is trying to move shards away to another nodes to get some free space, but it seems that it can't.
All your nodes have the same disk space? Can you increase the disk space? Elasticsearch has some watermarks that will trigger when your node gets 85%
, 90%
and 95%
of disk usage.
If the nodes does not have enough space this will impact the cluster and If I'm not wrong can lead to the issue you are getting.
You mentioned that you are using RKE, I have no experience with using Elasticsearch on k8s, but what is giving you the 503
error? Do you have anything in front of your Elasticsearch endpoint, some kind of ingress like nginx? Does the RKE makes any healthcheck to see if the Elasticsearch service is running? If yes, how it checks?
Can you connect directly on one of the pods and run a curl to the elasticsearch service?
Yes thanks.
All your nodes have the same disk space?
Yes the 3 workers nodes are the same and they share the persistent volume for elasticsearch
Can you increase the disk space? Yes I could but it’s not unlimited how could I know how much do i need in order to the service keeps flowing with its process (?)
what is giving you the 503
error? My elasticsearch url service http…. :9200 all the curl are returning service unavailable.
Yes it’s an ingress like nginx for elasticsearch.
No the only checking its the readiness probe of status green and timeouts.
These are the logs error for node 2 the rest of the nodes are not writing at the moment:
, {"type": "server", "timestamp": "2023-06-09T19:46:56,418Z", "level": "ERROR", "component": "o.e.x.i.h.ILMHistoryStore", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-2", "message": "failed to index 8 items into ILM history index", "cluster.uuid": "xxxxxxxxxxxxxxxxxxxxxxzOA", "node.id": "xxxxxxxxxxxxxxxxxxxxxV7egJQ" , {"type": "server", "timestamp": "2023-06-09T19:50:19,189Z", "level": "ERROR", "component": "o.e.c.a.s.ShardStateAction", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-2", "message": "[-------uat][0] unexpected failure while failing shard [shard id [[----------------uat][0]], allocation id [xxxxxxxxxxxx-LTXBA], primary term [0], message [failed recovery], failure [RecoveryFailedException[[xxxxxxxxxxxxxxxxxx-uat][0]: Recovery failed from {elasticsearch-master-0}{xxxxxxxuw}{xxxxxxxxxxxT9GU9YcWg}{192.168.191.19}{192.168.191.19:9300}{cdfhilmrstw}{ml.machine_memory=2147483648, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true} into {elasticsearch-master-1}{xxxxxxxxbWAQ}{xxxxxxxxxxgxgl6dE6g}{192.168.102.181}{192.168.102.181:9300}{cdfhilmrstw}{ml.machine_memory=2147483648, xpack.installed=true, transform.node=true, ml.max_open_jobs=512, ml.max_jvm_size=1073741824}]; nested: RemoteTransportException[[elasticsearch-master-0][192.168.191.19:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[elasticsearch-master-1][192.168.102.181:9300][internal:index/shard/recovery/file_chunk]]; nested: UncategorizedExecutionException[Failed execution]; nested: NotSerializableExceptionWrapper[execution_exception: java.io.IOException: No space left on device]; nested: IOException[No space left on device]; ], markAsStale [true]]", "cluster.uuid": "xxxxxxxxxxxxxx7uzOA", "node.id": "xxxxxxxxxxxxxxxxxxxxxV7egJQ"
You need to increase the disk space to be able to bring your cluster up again.
After your cluster is running you will need to check your indices and see what is taking more space and see what you can delete.
To avoid getting into this issue again you will need to have an Index Lifecycle Policy that will delete your indices after a pre-determined time, what is this time and how you configure it depends entirely on what you are indexing and how.
Elasticsearch has three watermarks, low, high and flood, per default those watermark will trigger at 85%
, 90%
and 95%
of disk usage, you can change this to be more efficient, but you need to have these watermarks configured.
What do you mean by that? Can you share any config related to this? Each node needs to have its own data directory.
Thanks, ok I’ll add more space. In order to give more space do I need to stop logstash and Kibana? Also while I give more space do I need to escalate to 0 my elasticsearch pods I mean restart elasticsearch?
Currently I’m at 82% used
I have 3 Rhel nodes with the same specs running with one persistent volume shared between them that elasticsearch has to storage its data.
What I read about Elasticsearch running in rke with 3 workers nodes and one persistent volume is the following:
Shard Distribution:
With a shared persistent volume, Elasticsearch's shard allocation process will focus on factors other than storage capacity to distribute shards across worker nodes:
a. Node Availability: The shard allocation process takes into account the availability of the worker nodes. It ensures that shards are allocated to nodes that are up and running, excluding any nodes that are unavailable or undergoing maintenance.
b. Shard Balance: Elasticsearch aims to achieve a balanced distribution of shards across worker nodes to evenly distribute the data and workload. It considers the number of shards on each node and attempts to distribute shards proportionally based on the current distribution.
c. Resource Utilization: The shard allocation process also considers the resource utilization of each worker node. It evaluates the CPU, memory, and disk usage on the nodes to prevent overloading any specific node. If a node is already heavily utilized, the shard allocation process may prefer allocating shards to less utilized nodes to maintain a balanced resource distribution.
Since the shared persistent volume provides access to the same underlying storage resources for all worker nodes, the shard allocation process doesn't need to consider storage capacity as a primary factor for shard distribution.
You should stop both Logstash and Kibana, and start them after the cluster is up again.
Not sure, I do not use Elasticsearch on k8s, so I do not know what will happen if you scale it to 0. You just need to stop the pods, increase the space and start them again.
No idea if this is correct, as I only use Elastic on traditional VMs and I do not know how this shared persistent volume works, but you need to make sure that you won't hit the watermarks, since the underlying storage is shared, all your cluster will probably hit the watermark as the same time, and this can be problematic.
Thanks for your help. I did the upgrade in space of the persistent volume I leave it there to process and now is up and active running the elasticsearch service and Kibana is up and running.
As far as I am aware you should not run multiple Elasticsearch nodes off a shared persistent volume. Each node should have its own persistent volume.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.