ElasticSearch Unstable

Hi I have ELK stack setup running on docker swarm. I have 3 VM with 16GB of memory and each one has a ES instance running with 10GB of heap space (version 6.2.3). The problem is that it is unstable. I have turn off swap memory setup the ulimits and still every couple of hours it looks like it hangs where intake from log stash stops and all indexes go into a red state. The cluster will slowly recover (usually, but sometimes need to be for restarted). I unfortunately have been able to see a root cause. I keep getting errors like this.

central_logger_elasticsearch3.1.apwko5p7kyjl@logger-swarm3 | org.elasticsearch.action.NoShardAvailableActionException: No shard available for [get [.kibana][doc][config:6.2.3]: routing [null]] central_logger_elasticsearch3.1.apwko5p7kyjl@logger-swarm3 | at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.perform(TransportSingleShardAction.java:209) ~[elasticsearch-6.2.3.jar:6.2.3] central_logger_elasticsearch3.1.apwko5p7kyjl@logger-swarm3 | at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.start(TransportSingleShardAction.java:186) ~[elasticsearch-6.2.3.jar:6.2.3]

or

central_logger_elasticsearch1.1.pcf8ifg53ulg@logger-swarm2 | org.elasticsearch.transport.SendRequestTransportException: [elasticsearch1][10.0.0.27:9300][indices:admin/seq_no/global_checkpoint_sync[p]] central_logger_elasticsearch1.1.pcf8ifg53ulg@logger-swarm2 | at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:608) ~[elasticsearch-6.2.3.jar:6.2.3] central_logger_elasticsearch1.1.pcf8ifg53ulg@logger-swarm2 | at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:518) ~[elasticsearch-6.2.3.jar:6.2.3] central_logger_elasticsearch1.1.pcf8ifg53ulg@logger-swarm2 | at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:506) ~[elasticsearch-6.2.3.jar:6.2.3]

or

central_logger_elasticsearch3.1.apwko5p7kyjl@logger-swarm3 | org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed central_logger_elasticsearch3.1.apwko5p7kyjl@logger-swarm3 | at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:274) ~[elasticsearch-6.2.3.jar:6.2.3] central_logger_elasticsearch3.1.apwko5p7kyjl@logger-swarm3 | at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:132) ~[elasticsearch-6.2.3.jar:6.2.3] central_logger_elasticsearch3.1.apwko5p7kyjl@logger-swarm3 | at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:243) ~[elasticsearch-6.2.3.jar:6.2.3] central_logger_elasticsearch3.1.apwko5p7kyjl@logger-swarm3 | at org.elasticsearch.action.search.InitialSearchPhase.onShardFailure(InitialSearchPhase.java:107) ~[elasticsearch-6.2.3.jar:6.2.3] central_logger_elasticsearch3.1.apwko5p7kyjl@logger-swarm3 | at org.elasticsearch.action.search.InitialSearchPhase.lambda$performPhaseOnShard$4(InitialSearchPhase.java:205) ~[elasticsearch-6.2.3.jar:6.2.3]

Also I get tones of warnings like this:

[2018-05-28T15:25:07,249][WARN ][o.e.g.DanglingIndicesState] [elasticsearch3] [[filebeat-6.2.4-2018.05.14/pg9k4XIRR0-GOK5GferVDg]] can not be imported as a dangling index, as index with same name already exists in cluster metadata

(I have been told the dangling index is not a problem but want to include it for completelyness)

I have adjusted memory and searched high and low without finding a root cause or anything to point me toward a root cause. All these exceptions look like symptons not the root cause to me. So if you have anything that can point me to what might be going on I would really appreacite it.

These messages also sound more like symptoms than causes to me. Can you provide full logs from a period containing an outage, ideally from all three nodes? They might be quite large but should compress well.

I am posting the logs here. There are 2 different logs from ES3 because I am using the docker logs command and it did froze on me so I grabbed it in 2 different ways. The logs are here: https://www.dropbox.com/s/b12yujq10w1qhuq/es_logs.zip?dl=0

That's great. The fundamental problem is the inability to elect a stable master. I used this to grab relevant lines:

grep es2.log -e ' \[.*\(ZenDiscovery\|MasterService\)'

First, elasticsearch2 is elected master by elasticsearch1 (NB not elasticsearch3):

central_logger_elasticsearch2.1.jeas3pj8lu5a@logger-swarm1    | [2018-05-29T00:00:10,308][INFO ][o.e.c.s.MasterService    ] [elasticsearch2] zen-disco-elected-as-master ([1] nodes joined)[{elasticsearch1}{UohioltNQHyznjH4uwq0vg}{q6G5FZR1RSqjGb2uT1wVMw}{10.0.0.27}{10.0.0.27:9300}{ml.machine_memory=16826454016, ml.max_open_jobs=20, ml.enabled=true}], reason: new_master {elasticsearch2}{HVP-WhxhQjyRMLAgEFr9FQ}{pnnXMWK1SH-tIxYOFOEnsw}{10.0.0.13}{10.0.0.13:9300}{ml.machine_memory=16826454016, ml.max_open_jobs=20, ml.enabled=true}

It publishes a new cluster state to establish its leadership, but this times out after 30 seconds:

central_logger_elasticsearch2.1.jeas3pj8lu5a@logger-swarm1    | [2018-05-29T00:00:40,309][WARN ][o.e.d.z.ZenDiscovery     ] [elasticsearch2] zen-disco-failed-to-publish, current nodes: nodes: 
central_logger_elasticsearch2.1.jeas3pj8lu5a@logger-swarm1    | [2018-05-29T00:00:40,310][WARN ][o.e.c.s.MasterService    ] [elasticsearch2] failing [zen-disco-elected-as-master ([1] nodes joined)[{elasticsearch1}{UohioltNQHyznjH4uwq0vg}{q6G5FZR1RSqjGb2uT1wVMw}{10.0.0.27}{10.0.0.27:9300}{ml.machine_memory=16826454016, ml.max_open_jobs=20, ml.enabled=true}]]: failed to commit cluster state version [3596]

It starts another election round, wins with a vote from elasticsearch3 (NB not elasticsearch1):

central_logger_elasticsearch2.1.jeas3pj8lu5a@logger-swarm1    | [2018-05-29T00:00:43,368][INFO ][o.e.c.s.MasterService    ] [elasticsearch2] zen-disco-elected-as-master ([1] nodes joined)[{elasticsearch3}{gEtAcxDgQbqNmWXO_J2uWA}{rjbvRX1jRfGiayVDKO2Ppw}{10.0.0.25}{10.0.0.25:9300}{ml.machine_memory=16826437632, ml.max_open_jobs=20, ml.enabled=true}], reason: new_master {elasticsearch2}{HVP-WhxhQjyRMLAgEFr9FQ}{pnnXMWK1SH-tIxYOFOEnsw}{10.0.0.13}{10.0.0.13:9300}{ml.machine_memory=16826454016, ml.max_open_jobs=20, ml.enabled=true}

It publishes a new cluster state to establish its leadership, but this too times out after 30 seconds:

central_logger_elasticsearch2.1.jeas3pj8lu5a@logger-swarm1    | [2018-05-29T00:01:13,369][WARN ][o.e.d.z.ZenDiscovery     ] [elasticsearch2] zen-disco-failed-to-publish, current nodes: nodes: 
central_logger_elasticsearch2.1.jeas3pj8lu5a@logger-swarm1    | [2018-05-29T00:01:13,369][WARN ][o.e.c.s.MasterService    ] [elasticsearch2] failing [zen-disco-elected-as-master ([1] nodes joined)[{elasticsearch3}{gEtAcxDgQbqNmWXO_J2uWA}{rjbvRX1jRfGiayVDKO2Ppw}{10.0.0.25}{10.0.0.25:9300}{ml.machine_memory=16826437632, ml.max_open_jobs=20, ml.enabled=true}]]: failed to commit cluster state version [3596]

The cycle repeats:

central_logger_elasticsearch2.1.jeas3pj8lu5a@logger-swarm1    | [2018-05-29T00:01:16,419][INFO ][o.e.c.s.MasterService    ] [elasticsearch2] zen-disco-elected-as-master ([1] nodes joined)[{elasticsearch1}{UohioltNQHyznjH4uwq0vg}{q6G5FZR1RSqjGb2uT1wVMw}{10.0.0.27}{10.0.0.27:9300}{ml.machine_memory=16826454016, ml.max_open_jobs=20, ml.enabled=true}], reason: new_master {elasticsearch2}{HVP-WhxhQjyRMLAgEFr9FQ}{pnnXMWK1SH-tIxYOFOEnsw}{10.0.0.13}{10.0.0.13:9300}{ml.machine_memory=16826454016, ml.max_open_jobs=20, ml.enabled=true}
central_logger_elasticsearch2.1.jeas3pj8lu5a@logger-swarm1    | [2018-05-29T00:01:46,420][WARN ][o.e.d.z.ZenDiscovery     ] [elasticsearch2] zen-disco-failed-to-publish, current nodes: nodes: 
central_logger_elasticsearch2.1.jeas3pj8lu5a@logger-swarm1    | [2018-05-29T00:01:46,420][WARN ][o.e.c.s.MasterService    ] [elasticsearch2] failing [zen-disco-elected-as-master ([1] nodes joined)[{elasticsearch1}{UohioltNQHyznjH4uwq0vg}{q6G5FZR1RSqjGb2uT1wVMw}{10.0.0.27}{10.0.0.27:9300}{ml.machine_memory=16826454016, ml.max_open_jobs=20, ml.enabled=true}]]: failed to commit cluster state version [3596]
central_logger_elasticsearch2.1.jeas3pj8lu5a@logger-swarm1    | [2018-05-29T00:01:49,469][INFO ][o.e.c.s.MasterService    ] [elasticsearch2] zen-disco-elected-as-master ([1] nodes joined)[{elasticsearch3}{gEtAcxDgQbqNmWXO_J2uWA}{rjbvRX1jRfGiayVDKO2Ppw}{10.0.0.25}{10.0.0.25:9300}{ml.machine_memory=16826437632, ml.max_open_jobs=20, ml.enabled=true}], reason: new_master {elasticsearch2}{HVP-WhxhQjyRMLAgEFr9FQ}{pnnXMWK1SH-tIxYOFOEnsw}{10.0.0.13}{10.0.0.13:9300}{ml.machine_memory=16826454016, ml.max_open_jobs=20, ml.enabled=true}
central_logger_elasticsearch2.1.jeas3pj8lu5a@logger-swarm1    | [2018-05-29T00:02:19,470][WARN ][o.e.d.z.ZenDiscovery     ] [elasticsearch2] zen-disco-failed-to-publish, current nodes: nodes: 
central_logger_elasticsearch2.1.jeas3pj8lu5a@logger-swarm1    | [2018-05-29T00:02:19,470][WARN ][o.e.c.s.MasterService    ] [elasticsearch2] failing [zen-disco-elected-as-master ([1] nodes joined)[{elasticsearch3}{gEtAcxDgQbqNmWXO_J2uWA}{rjbvRX1jRfGiayVDKO2Ppw}{10.0.0.25}{10.0.0.25:9300}{ml.machine_memory=16826437632, ml.max_open_jobs=20, ml.enabled=true}]]: failed to commit cluster state version [3596]

The fact that not all nodes vote in each election, and that the cluster state publication is timing out, suggests that the nodes are too busy with search/indexing traffic to also act as master nodes, and that dedicated master nodes are the recommended way forwards here:

It is important for the stability of the cluster that master-eligible nodes do as little work as possible.

It is also possible that a contributing factor is that you have too many shards: I see a number of indices being created with the default 5 shards (filebeat, metricbeat, logstash, auditbeat, packetbeat) which could be more than you need. The default in 7.0.0 will be 1 shard.

Thanks, David, that is great to know. Based on what you told me I plan to.

  1. set the default to 1 shard per index with 1 replicant of that shard.
  2. set my 3 existing data nodes to node.master: false
  3. start 3 instances of ES that have no data and are just the master nodes ie:

node.master: true
node.data: false
node.ingest: false

One last question though. On docker swarm I would like to have 1 service for the ES master nodes and replicate it multiple times (3 times likely). But when I do this I get an error in the cluster that multiple nodes of the same name are trying to join. Is there a way around this?

Once again thanks so much for the help.

Sounds like a good plan, as long as you are ok with the cluster not being available between steps 2 and 3. If you wish to keep it available you'll need to set node.master: false on one data node, then bring in one dedicated master, and then repeat those two steps two more times.

Afraid I don't know enough about Docker Swarm to answer your other question for definite, but I think you can omit the node name from the configuration file and Elasticsearch will invent a random name the first time the node starts.

I changed the setup as described and I even removed the es data directory so that it would start from scratch and reduce noise in the logs. But, I am still having the same problem. I have reattached the logs. I looked through them and it looks like it might still have the same problem as before. If you have a minute to look again I would really appreciate it.

Thanks,

I am not sure if this makes a big difference but I am using the master nodes as coordination nodes as well.

This now looks more like a connectivity problem. The elected master seems stable, but the data nodes are not maintaining a stable connection to it. I suspect something funky related to Docker networking. Specifically, I see some indication that elasticsearch1 thinks its address is 10.0.0.81:

[elasticsearch1] send message failed [channel: NettyTcpChannel{localAddress=/10.0.0.81:9300, remoteAddress=/10.0.0.80:53212}]

but there's other evidence that it thinks it's 10.0.10.64:

[_CWly_t] zen-disco-node-join[{elasticsearch3}{XC16YvsrQaKwiMceSu8KVw}{o-0y8TNNRv6n3HeIvpiifg}{10.0.0.66}{10.0.0.66:9300}{ml.machine_memory=16826437632, ml.max_open_jobs=20, ml.enabled=true}, {elasticsearch2}{T4k7vNPvQeSgbmt7bgQp6w}{G--hwxpSQouFlOM-dNO-QQ}{10.0.0.68}{10.0.0.68:9300}{ml.machine_memory=16826454016, ml.max_open_jobs=20, ml.enabled=true}, {elasticsearch1}{gILStk8GQFO9HYXkqql7mQ}{q6wVSbG3Qw-QziDThrkurQ}{10.0.0.64}{10.0.0.64:9300}{ml.machine_memory=16826454016, ml.max_open_jobs=20, ml.enabled=true}], reason: added {{elasticsearch2}{T4k7vNPvQeSgbmt7bgQp6w}{G--hwxpSQouFlOM-dNO-QQ}{10.0.0.68}{10.0.0.68:9300}{ml.machine_memory=16826454016, ml.max_open_jobs=20, ml.enabled=true},}

I also see messages like No route to host that indicate something is going badly at the networking layer. Can you dig into that further?

This isn't recommended. Quoting the docs:

While master nodes can also behave as coordinating nodes and route search and indexing requests from clients to data nodes, it is better not to use dedicated master nodes for this purpose. It is important for the stability of the cluster that master-eligible nodes do as little work as possible.

Good news, I did some readjusting and have been stable for about 12 hours (so far that is a record for this cluster). What I ended up doing is

  1. removing one data node to free up memory (so 2 data nodes)
  2. increasing the memory of the 3 master nodes to 1.5G each
  3. creating a coordination node with 4G memory (all incoming traffic coming here)

So far stable and pretty good performance. I do have 1 question though. I noticed I can hit all nodes in the cluster with http://:9200 and I get back the standard "you know for search". However, if I hit each node in the cluster with http://:9200/_cat/nodes all the nodes but the master nodes respond correctly. The master nodes timeout. Even though the master nodes timeout the cluster is still up and functional. So I am wondering is this normal behavior for master nodes or it is showing a problem that is about to happen.

Thanks,

No, this is not normal, and the first thing I'd look for here is still a connectivity issue. GET /_cat/nodes is coordinated by the node that receives the request, and does three things:

  1. get the cluster state from the current master.
  2. reach out to every node in the cluster and get its NodeInfo.
  3. reach out again to every node in the cluster and get its NodeStats.

The master nodes shouldn't have trouble with the first step, so a timeout here suggests that they're unable to talk to one of the other nodes. Elasticsearch's intra-cluster connections are asymmetric so it is possible that the data nodes can request data from the master and not vice versa. That said, I'm a little surprised this is "stable": I'd expect nodes to be repeatedly joining and leaving the cluster and the logs to be quite noisy if this is the case.

OK, I just pulled my ES nodes out of docker and installed them directly on the host OS and they seem to be working better. However, I know see this warning:

[2018-06-05T21:48:36,363][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch3] [gc][10990] overhead, spent [277ms] collecting in the last [1s]
[2018-06-05T21:49:34,616][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch3] [gc][11048] overhead, spent [265ms] collecting in the last [1s]
[2018-06-05T21:52:20,296][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch3] [gc][11213] overhead, spent [250ms] collecting in the last [1s]
[2018-06-05T21:55:46,151][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch3] [gc][11418] overhead, spent [277ms] collecting in the last [1s]
[2018-06-05T21:58:49,471][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch3] [gc][11601] overhead, spent [265ms] collecting in the last [1s]
[2018-06-05T22:04:05,428][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch3] [gc][11916] overhead, spent [288ms] collecting in the last [1s]
[2018-06-05T22:09:02,627][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch3] [gc][12212] overhead, spent [289ms] collecting in the last [1s]
[2018-06-05T22:14:41,941][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch3] [gc][12550] overhead, spent [260ms] collecting in the last [1s]
[2018-06-05T22:19:12,863][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch3] [gc][12820] overhead, spent [272ms] collecting in the last [1s]
[2018-06-05T22:21:07,204][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch3] [gc][12934] overhead, spent [262ms] collecting in the last [1s]

The good news it seems to be working correctly, I am just not sure if it will keep working with these types of warnings. I have included some screenshots of stats of the cluster to help with any ideas of what I might want to change.

These messages are normal for a node that's under load. We output them to help to diagnose issues related to memory pressure, but I don't think there's any indication of that from what I see here.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.