All shards failed

Hello,

I am running elastic Stack (beats -> logstash -> elasticsearch -> kibana).
It is a single server (no cluster) that is collecting data from 10 or so servers (auditbeat and filebeat). Today, it stopped working. I cannot find why it is happening. When I restart elasticsearch it is possible to see the following in the log:

[2020-07-28T12:53:33,449][INFO ][o.e.p.PluginsService     ] [kibana] loaded module [x-pack-deprecation]
[2020-07-28T12:53:33,449][INFO ][o.e.p.PluginsService     ] [kibana] loaded module [x-pack-enrich]
[2020-07-28T12:53:33,449][INFO ][o.e.p.PluginsService     ] [kibana] loaded module [x-pack-eql]
[2020-07-28T12:53:33,449][INFO ][o.e.p.PluginsService     ] [kibana] loaded module [x-pack-graph]
[2020-07-28T12:53:33,449][INFO ][o.e.p.PluginsService     ] [kibana] loaded module [x-pack-identity-provider]
[2020-07-28T12:53:33,449][INFO ][o.e.p.PluginsService     ] [kibana] loaded module [x-pack-ilm]
[2020-07-28T12:53:33,450][INFO ][o.e.p.PluginsService     ] [kibana] loaded module [x-pack-logstash]
[2020-07-28T12:53:33,450][INFO ][o.e.p.PluginsService     ] [kibana] loaded module [x-pack-ml]
[2020-07-28T12:53:33,450][INFO ][o.e.p.PluginsService     ] [kibana] loaded module [x-pack-monitoring]
[2020-07-28T12:53:33,450][INFO ][o.e.p.PluginsService     ] [kibana] loaded module [x-pack-ql]
[2020-07-28T12:53:33,450][INFO ][o.e.p.PluginsService     ] [kibana] loaded module [x-pack-rollup]
[2020-07-28T12:53:33,450][INFO ][o.e.p.PluginsService     ] [kibana] loaded module [x-pack-security]
[2020-07-28T12:53:33,450][INFO ][o.e.p.PluginsService     ] [kibana] loaded module [x-pack-sql]
[2020-07-28T12:53:33,451][INFO ][o.e.p.PluginsService     ] [kibana] loaded module [x-pack-voting-only-node]
[2020-07-28T12:53:33,452][INFO ][o.e.p.PluginsService     ] [kibana] loaded module [x-pack-watcher]
[2020-07-28T12:53:33,455][INFO ][o.e.p.PluginsService     ] [kibana] no plugins loaded
[2020-07-28T12:53:33,555][INFO ][o.e.e.NodeEnvironment    ] [kibana] using [1] data paths, mounts [[/var/lib/elasticsearch (/dev/sdc1)]], net usable_space [397gb], net total_space [1006.9gb], types [ext4]
[2020-07-28T12:53:33,555][INFO ][o.e.e.NodeEnvironment    ] [kibana] heap size [16gb], compressed ordinary object pointers [true]
[2020-07-28T12:53:34,869][INFO ][o.e.n.Node               ] [kibana] node name [kibana], node ID [xJIkc_wTSTyOC8i-c4BFaw], cluster name [elasticsearch]
[2020-07-28T12:53:41,277][INFO ][o.e.x.s.a.s.FileRolesStore] [kibana] parsed [0] roles from file [/etc/elasticsearch/roles.yml]
[2020-07-28T12:53:42,669][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [kibana] [controller/4055] [Main.cc@110] controller (64 bit): Version 7.8.0 (Build 58ff6912e20047) Copyright (c) 2020 Elasticsearch BV
[2020-07-28T12:53:43,488][INFO ][o.e.d.DiscoveryModule    ] [kibana] using discovery type [zen] and seed hosts providers [settings]
[2020-07-28T12:53:44,717][INFO ][o.e.n.Node               ] [kibana] initialized
[2020-07-28T12:53:44,722][INFO ][o.e.n.Node               ] [kibana] starting ...
[2020-07-28T12:53:45,069][INFO ][o.e.t.TransportService   ] [kibana] publish_address {127.0.0.1:9300}, bound_addresses {127.0.0.1:9300}
[2020-07-28T12:53:47,732][WARN ][o.e.b.BootstrapChecks    ] [kibana] the default discovery settings are unsuitable for production use; at least one of [discovery.seed_hosts, discovery.seed_providers, cluster.initial_master_nodes] must be configured
[2020-07-28T12:53:47,733][INFO ][o.e.c.c.Coordinator      ] [kibana] cluster UUID [cg50k59xTgyfZVKvbKtx-w]
[2020-07-28T12:53:47,742][INFO ][o.e.c.c.ClusterBootstrapService] [kibana] no discovery configuration found, will perform best-effort cluster bootstrapping after [3s] unless existing master is discovered
[2020-07-28T12:53:48,090][INFO ][o.e.c.s.MasterService    ] [kibana] elected-as-master ([1] nodes joined)[{kibana}{xJIkc_wTSTyOC8i-c4BFaw}{wc0JqiLkTRGBOVAfaYN8hA}{127.0.0.1}{127.0.0.1:9300}{dilmrt}{ml.machine_memory=29479157760, xpack.installed=true, transform.node=true, ml.max_open_jobs=20} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 107, version: 43098, delta: master node changed {previous [], current [{kibana}{xJIkc_wTSTyOC8i-c4BFaw}{wc0JqiLkTRGBOVAfaYN8hA}{127.0.0.1}{127.0.0.1:9300}{dilmrt}{ml.machine_memory=29479157760, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}]}
[2020-07-28T12:53:48,816][INFO ][o.e.c.s.ClusterApplierService] [kibana] master node changed {previous [], current [{kibana}{xJIkc_wTSTyOC8i-c4BFaw}{wc0JqiLkTRGBOVAfaYN8hA}{127.0.0.1}{127.0.0.1:9300}{dilmrt}{ml.machine_memory=29479157760, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}]}, term: 107, version: 43098, reason: Publication{term=107, version=43098}
[2020-07-28T12:53:48,947][INFO ][o.e.h.AbstractHttpServerTransport] [kibana] publish_address {127.0.0.1:9200}, bound_addresses {127.0.0.1:9200}
[2020-07-28T12:53:48,947][INFO ][o.e.n.Node               ] [kibana] started
[2020-07-28T12:53:50,411][INFO ][o.e.c.s.ClusterSettings  ] [kibana] updating [xpack.monitoring.collection.enabled] from [false] to [true]
[2020-07-28T12:53:51,651][INFO ][o.e.l.LicenseService     ] [kibana] license [c85a0dd5-abe3-4108-b556-b7fa4726d94b] mode [basic] - valid
[2020-07-28T12:53:51,652][INFO ][o.e.x.s.s.SecurityStatusChangeListener] [kibana] Active license is now [BASIC]; Security is disabled
[2020-07-28T12:53:51,682][INFO ][o.e.g.GatewayService     ] [kibana] recovered [179] indices into cluster_state
[2020-07-28T12:53:51,974][WARN ][r.suppressed             ] [kibana] path: /.kibana/_count, params: {index=.kibana}
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:551) [elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:309) [elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:582) [elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:393) [elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$0(AbstractSearchAsyncAction.java:223) [elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction$2.doRun(AbstractSearchAsyncAction.java:288) [elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) [elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:695) [elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.8.0.jar:7.8.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
        at java.lang.Thread.run(Thread.java:832) [?:?]
[2020-07-28T12:53:51,974][WARN ][r.suppressed             ] [kibana] path: /.kibana_task_manager/_count, params: {index=.kibana_task_manager}
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:551) [elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:309) [elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:582) [elasticsearch-7.8.0.jar:7.8.0]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:393) [elasticsearch-7.8.0.jar:7.8.0]

What is in the Elasticsearch logs?

What does the output from _cat/indices/.kibana?v show?

Hello,

Thank you for the response. The log I sent you was from the elasticsearch logs when it when the service restarts. Now, the following messages are seen:

[2020-07-29T00:01:50,326][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][4380] overhead, spent [10.2s] collecting in the last [10.7s]
[2020-07-29T00:02:00,834][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][old][4381][3773] duration [10.1s], collections [1]/[10.4s], total [10.1s]/[10.6h], memory [15.5gb]->[15.6gb]/[16gb], all_pools {[young] [0b]->[24mb]/[0b]}{[old] [15.5gb]->[15.5gb]/[16gb]}{[survivor] [0b]->[0b]/[0b]}
[2020-07-29T00:02:00,834][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][4381] overhead, spent [10.2s] collecting in the last [10.4s]
[2020-07-29T00:02:11,580][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][old][4382][3774] duration [10.2s], collections [1]/[10.7s], total [10.2s]/[10.6h], memory [15.6gb]->[15.6gb]/[16gb], all_pools {[young] [24mb]->[8mb]/[0b]}{[old] [15.5gb]->[15.6gb]/[16gb]}{[survivor] [0b]->[0b]/[0b]}
[2020-07-29T00:02:11,580][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][4382] overhead, spent [10.2s] collecting in the last [10.7s]
[2020-07-29T00:02:11,594][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [kibana] collector [index_recovery] timed out when collecting data
[2020-07-29T00:02:21,911][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][old][4383][3775] duration [10.1s], collections [1]/[10.3s], total [10.1s]/[10.6h], memory [15.6gb]->[15.6gb]/[16gb], all_pools {[young] [8mb]->[0b]/[0b]}{[old] [15.6gb]->[15.6gb]/[16gb]}{[survivor] [0b]->[0b]/[0b]}
[2020-07-29T00:02:21,915][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][4383] overhead, spent [10.1s] collecting in the last [10.3s]
[2020-07-29T00:02:21,920][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [kibana] collector [cluster_stats] timed out when collecting data
[2020-07-29T00:02:32,354][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][old][4384][3776] duration [10.1s], collections [1]/[10.4s], total [10.1s]/[10.6h], memory [15.6gb]->[15.5gb]/[16gb], all_pools {[young] [0b]->[0b]/[0b]}{[old] [15.6gb]->[15.5gb]/[16gb]}{[survivor] [0b]->[0b]/[0b]}
[2020-07-29T00:02:32,355][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][4384] overhead, spent [10.2s] collecting in the last [10.4s]
[2020-07-29T00:02:32,369][ERROR][o.e.x.m.c.n.NodeStatsCollector] [kibana] collector [node_stats] timed out when collecting data
[2020-07-29T00:02:42,807][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][old][4385][3777] duration [10.2s], collections [1]/[10.4s], total [10.2s]/[10.6h], memory [15.5gb]->[15.5gb]/[16gb], all_pools {[young] [0b]->[0b]/[0b]}{[old] [15.5gb]->[15.5gb]/[16gb]}{[survivor] [0b]->[0b]/[0b]}
[2020-07-29T00:02:42,807][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][4385] overhead, spent [10.2s] collecting in the last [10.4s]
[2020-07-29T00:02:42,813][ERROR][o.e.x.m.c.i.IndexStatsCollector] [kibana] collector [index-stats] timed out when collecting data
[2020-07-29T00:02:53,217][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][old][4386][3778] duration [10.1s], collections [1]/[10.4s], total [10.1s]/[10.6h], memory [15.5gb]->[15.6gb]/[16gb], all_pools {[young] [0b]->[16mb]/[0b]}{[old] [15.5gb]->[15.5gb]/[16gb]}{[survivor] [0b]->[0b]/[0b]}
[2020-07-29T00:02:53,217][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][4386] overhead, spent [10.2s] collecting in the last [10.4s]
[2020-07-29T00:03:03,616][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][old][4387][3779] duration [10.2s], collections [1]/[10.3s], total [10.2s]/[10.6h], memory [15.6gb]->[15.5gb]/[16gb], all_pools {[young] [16mb]->[0b]/[0b]}{[old] [15.5gb]->[15.5gb]/[16gb]}{[survivor] [0b]->[0b]/[0b]}
[2020-07-29T00:03:03,616][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][4387] overhead, spent [10.2s] collecting in the last [10.3s]
[2020-07-29T00:03:14,456][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][old][4388][3780] duration [10.1s], collections [1]/[10.8s], total [10.1s]/[10.6h], memory [15.5gb]->[15.6gb]/[16gb], all_pools {[young] [0b]->[0b]/[0b]}{[old] [15.5gb]->[15.6gb]/[16gb]}{[survivor] [0b]->[0b]/[0b]}
[2020-07-29T00:03:14,456][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][4388] overhead, spent [10.2s] collecting in the last [10.8s]
[2020-07-29T00:03:25,682][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [kibana] collector [index_recovery] timed out when collecting data
[2020-07-29T00:03:25,682][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][old][4390][3781] duration [10.1s], collections [1]/[10.2s], total [10.1s]/[10.6h], memory [15.9gb]->[15.7gb]/[16gb], all_pools {[young] [304mb]->[8mb]/[0b]}{[old] [15.6gb]->[15.6gb]/[16gb]}{[survivor] [0b]->[0b]/[0b]}
[2020-07-29T00:03:25,682][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][4390] overhead, spent [10.1s] collecting in the last [10.2s]
[2020-07-29T00:03:36,087][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [kibana] collector [cluster_stats] timed out when collecting data
[2020-07-29T00:03:36,088][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][old][4391][3782] duration [10.1s], collections [1]/[10.4s], total [10.1s]/[10.6h], memory [15.7gb]->[15.6gb]/[16gb], all_pools {[young] [8mb]->[24mb]/[0b]}{[old] [15.6gb]->[15.6gb]/[16gb]}{[survivor] [0b]->[0b]/[0b]}
[2020-07-29T00:03:36,089][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][4391] overhead, spent [10.1s] collecting in the last [10.4s]
[2020-07-29T00:03:46,527][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][old][4392][3783] duration [10.2s], collections [1]/[10.4s], total [10.2s]/[10.6h], memory [15.6gb]->[15.6gb]/[16gb], all_pools {[young] [24mb]->[8mb]/[0b]}{[old] [15.6gb]->[15.6gb]/[16gb]}{[survivor] [0b]->[0b]/[0b]}
[2020-07-29T00:03:46,527][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][4392] overhead, spent [10.2s] collecting in the last [10.4s]
[2020-07-29T00:03:46,531][ERROR][o.e.x.m.c.n.NodeStatsCollector] [kibana] collector [node_stats] timed out when collecting data
[2020-07-29T00:03:56,849][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][old][4393][3784] duration [10.1s], collections [1]/[10.3s], total [10.1s]/[10.6h], memory [15.6gb]->[15.6gb]/[16gb], all_pools {[young] [8mb]->[0b]/[0b]}{[old] [15.6gb]->[15.6gb]/[16gb]}{[survivor] [0b]->[0b]/[0b]}
[2020-07-29T00:03:56,849][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][4393] overhead, spent [10.1s] collecting in the last [10.3s]
[2020-07-29T00:03:56,851][ERROR][o.e.x.m.c.i.IndexStatsCollector] [kibana] collector [index-stats] timed out when collecting data
[2020-07-29T00:04:07,267][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][old][4394][3785] duration [10.2s], collections [1]/[10.4s], total [10.2s]/[10.7h], memory [15.6gb]->[15.6gb]/[16gb], all_pools {[young] [0b]->[16mb]/[0b]}{[old] [15.6gb]->[15.6gb]/[16gb]}{[survivor] [0b]->[0b]/[0b]}
[2020-07-29T00:04:07,267][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][4394] overhead, spent [10.2s] collecting in the last [10.4s]
[2020-07-29T00:04:17,707][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][old][4395][3786] duration [10.1s], collections [1]/[10.4s], total [10.1s]/[10.7h], memory [15.6gb]->[15.6gb]/[16gb], all_pools {[young] [16mb]->[8mb]/[0b]}{[old] [15.6gb]->[15.6gb]/[16gb]}{[survivor] [0b]->[0b]/[0b]}
[2020-07-29T00:04:17,707][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][4395] overhead, spent [10.2s] collecting in the last [10.4s]
[2020-07-29T00:04:28,049][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][old][4396][3787] duration [10.1s], collections [1]/[10.3s], total [10.1s]/[10.7h], memory [15.6gb]->[15.6gb]/[16gb], all_pools {[young] [8mb]->[40mb]/[0b]}{[old] [15.6gb]->[15.6gb]/[16gb]}{[survivor] [0b]->[0b]/[0b]}
[2020-07-29T00:04:28,049][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][4396] overhead, spent [10.1s] collecting in the last [10.3s]
[2020-07-29T00:04:38,434][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][old][4397][3788] duration [10.1s], collections [1]/[10.3s], total [10.1s]/[10.7h], memory [15.6gb]->[15.6gb]/[16gb], all_pools {[young] [40mb]->[32mb]/[0b]}{[old] [15.6gb]->[15.6gb]/[16gb]}{[survivor] [0b]->[0b]/[0b]}
[2020-07-29T00:04:38,434][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][4397] overhead, spent [10.2s] collecting in the last [10.3s]
[2020-07-29T00:04:48,842][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][old][4398][3789] duration [10.2s], collections [1]/[10.4s], total [10.2s]/[10.7h], memory [15.6gb]->[15.6gb]/[16gb], all_pools {[young] [32mb]->[0b]/[0b]}{[old] [15.6gb]->[15.6gb]/[16gb]}{[survivor] [0b]->[0b]/[0b]}
[2020-07-29T00:04:48,842][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][4398] overhead, spent [10.2s] collecting in the last [10.4s]
[2020-07-29T00:04:59,315][WARN ][o.e.m.j.JvmGcMonitorService] [kibana] [gc][old][4399][3790] duration [10.1s], collections [1]/[10.4s], total [10.1s]/[10.7h], memory [15.6gb]->[15.6gb]/[16gb], all_pools {[young] [0b]->[0b]/[0b]}{[old] [15.6gb]->[15.6gb]/[16gb]}{[survivor] [0b]->[0b]/[0b]}

Hello, thank you for answering. When I try to curl elasticsearch node, it keeps waiting, and I dont receive any response. In case it is helpful for you, when I access kibana, I can see the following:

Elasticsearch is suffering from heap pressure so I would recommend either increasing RAM/heap or expanding the cluster. You could also delete some data. What is the full output of the cluster stats API?

Sorry for the late reply. When I curl the port, no response is given. It keeps waiting forever.

Then it seems in pretty bad shape. Can you try stopping indexing and restarting it?

Hello,

Thank you for the responses. I increased the RAM and then the heap and started working properly.
Thank you all for the advices!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.