Hi I have ELK stack setup running on docker swarm. I have 3 VM with 16GB of memory and each one has a ES instance running with 10GB of heap space (version 6.2.3). The problem is that it is unstable. I have turn off swap memory setup the ulimits and still every couple of hours it looks like it hangs where intake from log stash stops and all indexes go into a red state. The cluster will slowly recover (usually, but sometimes need to be for restarted). I unfortunately have been able to see a root cause. I keep getting errors like this.
central_logger_elasticsearch3.1.apwko5p7kyjl@logger-swarm3 | org.elasticsearch.action.NoShardAvailableActionException: No shard available for [get [.kibana][doc][config:6.2.3]: routing [null]] central_logger_elasticsearch3.1.apwko5p7kyjl@logger-swarm3 | at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.perform(TransportSingleShardAction.java:209) ~[elasticsearch-6.2.3.jar:6.2.3] central_logger_elasticsearch3.1.apwko5p7kyjl@logger-swarm3 | at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.start(TransportSingleShardAction.java:186) ~[elasticsearch-6.2.3.jar:6.2.3]
or
central_logger_elasticsearch1.1.pcf8ifg53ulg@logger-swarm2 | org.elasticsearch.transport.SendRequestTransportException: [elasticsearch1][10.0.0.27:9300][indices:admin/seq_no/global_checkpoint_sync[p]] central_logger_elasticsearch1.1.pcf8ifg53ulg@logger-swarm2 | at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:608) ~[elasticsearch-6.2.3.jar:6.2.3] central_logger_elasticsearch1.1.pcf8ifg53ulg@logger-swarm2 | at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:518) ~[elasticsearch-6.2.3.jar:6.2.3] central_logger_elasticsearch1.1.pcf8ifg53ulg@logger-swarm2 | at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:506) ~[elasticsearch-6.2.3.jar:6.2.3]
or
central_logger_elasticsearch3.1.apwko5p7kyjl@logger-swarm3 | org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed central_logger_elasticsearch3.1.apwko5p7kyjl@logger-swarm3 | at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:274) ~[elasticsearch-6.2.3.jar:6.2.3] central_logger_elasticsearch3.1.apwko5p7kyjl@logger-swarm3 | at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:132) ~[elasticsearch-6.2.3.jar:6.2.3] central_logger_elasticsearch3.1.apwko5p7kyjl@logger-swarm3 | at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:243) ~[elasticsearch-6.2.3.jar:6.2.3] central_logger_elasticsearch3.1.apwko5p7kyjl@logger-swarm3 | at org.elasticsearch.action.search.InitialSearchPhase.onShardFailure(InitialSearchPhase.java:107) ~[elasticsearch-6.2.3.jar:6.2.3] central_logger_elasticsearch3.1.apwko5p7kyjl@logger-swarm3 | at org.elasticsearch.action.search.InitialSearchPhase.lambda$performPhaseOnShard$4(InitialSearchPhase.java:205) ~[elasticsearch-6.2.3.jar:6.2.3]
Also I get tones of warnings like this:
[2018-05-28T15:25:07,249][WARN ][o.e.g.DanglingIndicesState] [elasticsearch3] [[filebeat-6.2.4-2018.05.14/pg9k4XIRR0-GOK5GferVDg]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
(I have been told the dangling index is not a problem but want to include it for completelyness)
I have adjusted memory and searched high and low without finding a root cause or anything to point me toward a root cause. All these exceptions look like symptons not the root cause to me. So if you have anything that can point me to what might be going on I would really appreacite it.