Single node troubleshooting help required


We've had a single node running logstash, elasticsearch and kibana for about six months. Although it wasn't intended to be a production system, it's become essential for troubleshooting - we use it to capture firewall logs via syslog.

It was working fine, but at some point in the last few weeks Kibana has begun to just give time outs (or sometimes 500 error).

In /var/log/elasticsearch//elasticsearch.log I'm seeing stuff like this:

[2019-09-20T10:00:42,499][DEBUG][o.e.a.s.TransportSearchAction] [h_iLag0] All shards failed for phase: [query]
[2019-09-20T10:00:42,500][WARN ][r.suppressed ] [h_iLag0] path: /.kibana_task_manager/_doc/_search, params: {ignore_unavailable=true, index=.kibana_task_manager, type=_doc} all shards failed
at ~[elasticsearch-6.8.2.jar:6.8.2]
at ~[elasticsearch-6.8.2.jar:6.8.2]
at ~[elasticsearch-6.8.2.jar:6.8.2]
at ~[elasticsearch-6.8.2.jar:6.8.2]
at$performPhaseOnShard$1( ~[elasticsearch-6.8.2.jar:6.8.2]
at$1.doRun( [elasticsearch-6.8.2.jar:6.8.2]
at [elasticsearch-6.8.2.jar:6.8.2]
at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun( [elasticsearch-6.8.2.jar:6.8.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun( [elasticsearch-6.8.2.jar:6.8.2]
at [elasticsearch-6.8.2.jar:6.8.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker( [?:1.8.0_222]
at java.util.concurrent.ThreadPoolExecutor$ [?:1.8.0_222]
at [?:1.8.0_222]
[2019-09-20T10:00:44,393][WARN ][o.e.m.j.JvmGcMonitorService] [h_iLag0] [gc][58] overhead, spent [1.5s] collecting in the last [1.6s]

I've no idea where to start trying to understand what this is about - can anyone supply some pointers? I've tried simple things like checking for disk space, system load and rebooting.

We're running 6.8.2 on Debian.

Hi Ian and welcome to the forum!

You are probably having too many shards in the cluster by now, especially should you use the default of 5 primaries per index that older versions had set. There should ideally be less than 20 per GB of JVM heap configured ( Running all parts on the same single node is also not recommended as this will lead to resource contentions and the off-heap for example can not be used as expected and our recommendations are based on single use here.

The error message you posted contains mention of quite long garbage collections (gc) which are an indication of the heap not being sufficient for the amount of shards open.

I would suggest you look at managing the indices, we have index lifecycle management (ILM) that can help automatically with this.

If it has been running fine until now than the hardware resources are probably OK for your usage and you would just need to optimise.
If this data is production critical as you say I highly recommend to use snapshots to have a backup copy. The risk of failure on just one node is far too high.

Hope this helps and have a great weekend!

1 Like

That's really helpful. In my naivety I'd left the heap size configured in jvm.options at 1GB, and with over 800 (admittedly small) shards it's no wonder it was struggling. I've increased the heap and changed the logstash index template to only use one shard for each new daily index. Hopefully this will allow stuff to remain stable until I build out some more nodes.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.