Single node troubleshooting help required

Ian_Miller · September 20, 2019, 1:23pm

Hi,

We've had a single node running logstash, elasticsearch and kibana for about six months. Although it wasn't intended to be a production system, it's become essential for troubleshooting - we use it to capture firewall logs via syslog.

It was working fine, but at some point in the last few weeks Kibana has begun to just give time outs (or sometimes 500 error).

In /var/log/elasticsearch//elasticsearch.log I'm seeing stuff like this:

[2019-09-20T10:00:42,499][DEBUG][o.e.a.s.TransportSearchAction] [h_iLag0] All shards failed for phase: [query]
[2019-09-20T10:00:42,500][WARN ][r.suppressed ] [h_iLag0] path: /.kibana_task_manager/_doc/_search, params: {ignore_unavailable=true, index=.kibana_task_manager, type=_doc}
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:296) ~[elasticsearch-6.8.2.jar:6.8.2]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:133) ~[elasticsearch-6.8.2.jar:6.8.2]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:259) ~[elasticsearch-6.8.2.jar:6.8.2]
at org.elasticsearch.action.search.InitialSearchPhase.onShardFailure(InitialSearchPhase.java:100) ~[elasticsearch-6.8.2.jar:6.8.2]
at org.elasticsearch.action.search.InitialSearchPhase.lambda$performPhaseOnShard$1(InitialSearchPhase.java:208) ~[elasticsearch-6.8.2.jar:6.8.2]
at org.elasticsearch.action.search.InitialSearchPhase$1.doRun(InitialSearchPhase.java:187) [elasticsearch-6.8.2.jar:6.8.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.8.2.jar:6.8.2]
at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41) [elasticsearch-6.8.2.jar:6.8.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) [elasticsearch-6.8.2.jar:6.8.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.8.2.jar:6.8.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_222]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_222]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222]
[2019-09-20T10:00:44,393][WARN ][o.e.m.j.JvmGcMonitorService] [h_iLag0] [gc][58] overhead, spent [1.5s] collecting in the last [1.6s]

I've no idea where to start trying to understand what this is about - can anyone supply some pointers? I've tried simple things like checking for disk space, system load and rebooting.

We're running 6.8.2 on Debian.

Janko · September 20, 2019, 7:49pm

Hi Ian and welcome to the forum!

You are probably having too many shards in the cluster by now, especially should you use the default of 5 primaries per index that older versions had set. There should ideally be less than 20 per GB of JVM heap configured (https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster). Running all parts on the same single node is also not recommended as this will lead to resource contentions and the off-heap for example can not be used as expected and our recommendations are based on single use here.

The error message you posted contains mention of quite long garbage collections (gc) which are an indication of the heap not being sufficient for the amount of shards open.

I would suggest you look at managing the indices, we have index lifecycle management (ILM) that can help automatically with this.

If it has been running fine until now than the hardware resources are probably OK for your usage and you would just need to optimise.
If this data is production critical as you say I highly recommend to use snapshots to have a backup copy. The risk of failure on just one node is far too high.

Hope this helps and have a great weekend!

Ian_Miller · September 24, 2019, 2:33pm

That's really helpful. In my naivety I'd left the heap size configured in jvm.options at 1GB, and with over 800 (admittedly small) shards it's no wonder it was struggling. I've increased the heap and changed the logstash index template to only use one shard for each new daily index. Hopefully this will allow stuff to remain stable until I build out some more nodes.

system · October 22, 2019, 2:33pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch crashing Elasticsearch	2	763	April 20, 2019
Elasticsearch node fails after kibana startup .action.search.SearchPhaseExecutionException: all shards failed Elasticsearch	1	1087	October 14, 2019
ALL SHARD FAILED - SEARCH PHASE EXECUTION EXCEPTION Kibana	9	20249	April 30, 2019
Discover: An error occurred with your request. Reset your inputs and try again Kibana	6	3309	July 6, 2017
All shards failed-yellow status Elasticsearch	5	2000	February 11, 2020

Single node troubleshooting help required

Related topics