Cluster broken after 7.2 -> 7.4 upgrade

Hi

I upgraded my cluster (3 master nodes) from 7.2 to 7.4 last night after which it completely fell apart.

It looks like there is some communication issue between the nodes in the cluster and they keep on trying to elect a master. Some requests time out, the master election fails and they restart the master election.

Here's the log from a node which was a master that failed

Thanks for picking the right TRACE loggers to help us here. It looks to me like it is running unbelivably slowly when writing the cluster metadata.

...
[2019-11-01T12:59:27,808][TRACE][o.e.g.MetaStateService   ] [el-bauni-lean] [[.triggered_watches/FO5I-Lm4SBK0ORPzqGcCZw]] writing state, reason [freshly created]
[2019-11-01T13:00:21,854][TRACE][o.e.g.MetaStateService   ] [el-bauni-lean] [[.triggered_watches/FO5I-Lm4SBK0ORPzqGcCZw]] state written
[2019-11-01T13:00:21,854][TRACE][o.e.g.MetaStateService   ] [el-bauni-lean] [[.monitoring-kibana-7-2019.10.29/kvA2TF6qRcavKFSCxDpx7g]] writing state, reason [freshly created]
[2019-11-01T13:01:26,265][TRACE][o.e.g.MetaStateService   ] [el-bauni-lean] [[.monitoring-kibana-7-2019.10.29/kvA2TF6qRcavKFSCxDpx7g]] state written
[2019-11-01T13:01:43,634][TRACE][o.e.g.MetaStateService   ] [el-bauni-lean] [[netflow-1.0.0-000012/oOtLDL0aQXG4spHE9TjP3Q]] writing state, reason [freshly created]
[2019-11-01T13:02:22,438][TRACE][o.e.g.MetaStateService   ] [el-bauni-lean] [[netflow-1.0.0-000012/oOtLDL0aQXG4spHE9TjP3Q]] state written
[2019-11-01T13:03:08,802][TRACE][o.e.g.MetaStateService   ] [el-bauni-lean] [[.tasks/8RwdojhIRXukEf2n6ulDYA]] writing state, reason [freshly created]
[2019-11-01T13:03:18,200][TRACE][o.e.g.MetaStateService   ] [el-bauni-lean] [[.tasks/8RwdojhIRXukEf2n6ulDYA]] state written
[2019-11-01T13:03:39,575][TRACE][o.e.g.MetaStateService   ] [el-bauni-lean] [[.kibana_1/kNIGRCsmSu6LUA3xAch2gw]] writing state, reason [freshly created]
[2019-11-01T13:03:40,138][TRACE][o.e.g.MetaStateService   ] [el-bauni-lean] [[.kibana_1/kNIGRCsmSu6LUA3xAch2gw]] state written
...

That's over 4 minutes to write just five metadata files. These metadata files are typically very small and yet we see here some of them taking well over a minute to write. It also looks like there are big pauses in between writing each file which is very strange since these files are written in a fairly tight loop. There's no mention of GC in this log. Is this node using swap perhaps? I can't think of any other reason why it would be running so desperately slowly.

Thanks for the quick reply!

ES heap is set to 8 GB and these are the nodes memory stats:

# free -h
                     total        used        free      shared  buff/cache   available
Mem:            14G        9.2G        160M        8.4M        5.2G        4.7G
Swap:          3.9G        5.9M        3.9G
# sysctl vm.swappiness
vm.swappiness = 1

I just checked IO usage using iostat and there seems to be a high amount of iowait going on on all nodes (over 25%) but I wouldn't think that was enough to cause such significant slowdowns.

I performed a random and sequential read/write test using dd and the speed wasn't bad enough to cause these problems. However the cluster is running on windows storage spaces which might be causing some problems behind the scenes (mix of ssds and hdds).

# dd if=/dev/zero of=/opt/elasticsearch-data/test1.img bs=1G count=1 oflag=dsync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 3.26445 s, 329 MB/s
# dd if=/dev/zero of=/opt/elasticsearch-data/test2.img bs=512 count=1000 oflag=dsync
1000+0 records in
1000+0 records out
512000 bytes (512 kB) copied, 5.75844 s, 88.9 kB/s

I also tested the network speed between the nodes using iperf and all nodes show speed between 2-4 Gbits/sec

I don't think these tests accurately reflect Elasticsearch's IO behaviour. dd only does sequential writes and does not AFAIK attempt to sync after each write, nor does it do any renaming or metadata syncs.

I misread that as MB/s not kB/s. 88.9 kB/s is definitely not the kind of IO performance that Elasticsearch expects.

I recommend using a better IO benchmarking tool than dd. For instance fio is a popular choice. To get something closer to the pattern of writes for metadata persistence you should write (and fsync) many small files.

Sequential:

fio-3.7
Starting 8 processes
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
seqread: Laying out IO file (1 file / 1024MiB)
Jobs: 4 (f=4): [_(1),R(1),_(2),R(3),_(1)][100.0%][r=78.4MiB/s,w=0KiB/s][r=10.0k,w=0 IOPS][eta 00m:00s]
seqread: (groupid=0, jobs=8): err= 0: pid=56088: Fri Nov  1 18:12:22 2019
   read: IOPS=9582, BW=74.9MiB/s (78.5MB/s)(8192MiB/109431msec)
    slat (usec): min=10, max=7126, avg=20.58, stdev=39.29
    clat (usec): min=2, max=277895, avg=803.10, stdev=748.74
     lat (usec): min=158, max=277911, avg=825.03, stdev=749.66
    clat percentiles (usec):
     |  1.00th=[  322],  5.00th=[  404], 10.00th=[  461], 20.00th=[  537],
     | 30.00th=[  594], 40.00th=[  652], 50.00th=[  701], 60.00th=[  750],
     | 70.00th=[  840], 80.00th=[  955], 90.00th=[ 1188], 95.00th=[ 1467],
     | 99.00th=[ 2573], 99.50th=[ 3523], 99.90th=[ 7046], 99.95th=[ 8455],
     | 99.99th=[10290]
   bw (  KiB/s): min= 4023, max=14025, per=12.54%, avg=9610.66, stdev=823.31, samples=1740
   iops        : min=  502, max= 1753, avg=1201.26, stdev=102.92, samples=1740
  lat (usec)   : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%
  lat (usec)   : 250=0.11%, 500=14.77%, 750=44.57%, 1000=23.37%
  lat (msec)   : 2=15.26%, 4=1.53%, 10=0.36%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.01%, 250=0.01%, 500=0.01%
  cpu          : usr=1.15%, sys=4.38%, ctx=1048406, majf=0, minf=284
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=1048576,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=74.9MiB/s (78.5MB/s), 74.9MiB/s-74.9MiB/s (78.5MB/s-78.5MB/s), io=8192MiB (8590MB), run=109431-109431msec

Disk stats (read/write):
    dm-3: ios=1048319/11, merge=0/0, ticks=765314/33, in_queue=767783, util=99.84%, aggrios=1048576/43, aggrmerge=0/6, aggrticks=762573/183, aggrin_queue=760848, aggrutil=99.71%
  sda: ios=1048576/43, merge=0/6, ticks=762573/183, in_queue=760848, util=99.71%

Random:

fio-3.7
Starting 1 process
test: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [r(1)][100.0%][r=89.9MiB/s,w=0KiB/s][r=23.0k,w=0 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=56728: Fri Nov  1 18:16:02 2019
   read: IOPS=22.5k, BW=87.0MiB/s (92.2MB/s)(4096MiB/46564msec)
   bw (  KiB/s): min=65272, max=113904, per=99.94%, avg=90024.84, stdev=9483.05, samples=93
   iops        : min=16318, max=28476, avg=22506.14, stdev=2370.75, samples=93
  cpu          : usr=7.13%, sys=34.35%, ctx=83253, majf=0, minf=94
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=1048576,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=87.0MiB/s (92.2MB/s), 87.0MiB/s-87.0MiB/s (92.2MB/s-92.2MB/s), io=4096MiB (4295MB), run=46564-46564msec

Disk stats (read/write):
    dm-3: ios=1044729/5, merge=0/0, ticks=2564528/24, in_queue=2564806, util=99.59%, aggrios=1048577/27, aggrmerge=0/3, aggrticks=2571570/108, aggrin_queue=2571240, aggrutil=99.55%
  sda: ios=1048577/27, merge=0/3, ticks=2571570/108, in_queue=2571240, util=99.55%

The speed was no problem until i upgraded from 7.2 to 7.4 though. I mean it wasnt fast but it wasn't this slow. Aggregates over several hundreds of gigabytes took maybe 20 seconds which was acceptable enough.

I can't think of any changes between 7.2 and 7.4 that would explain this. Are you still seeing these log messages from the MetaStateService indicating that it's taking many seconds to write each metadata file? Are these nodes purely master-eligible nodes or are they doing other jobs too? I guess it might help to get a thread dump while it's writing to see what else it's doing.

Here are the logs from all the nodes when I started them again this morning. It looks to me like the meta state writes are only taking a couple of seconds at most. The slowdown observed in the previous log was probably caused by something else.

bean.log
mean.log
lean.log

What I've noticed is that when master fails I get a ton of these SSL handshake errors

[2019-11-04T09:34:04,115][WARN ][o.e.t.TcpTransport       ] [el-bauni-mean] exception caught on transport layer [Netty4TcpChannel{localAddress=/172.24.8.151:9300, remoteAddress=/172.24.8.152:39624}], closing connection
io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)

EDIT: The problem continues even if I disable xpack security and ssl so the SSL handshake isn't the problem.
EDIT2: Forgot to answer your other questions, the cluster is also a rabbitmq cluster which has very low traffic. I turned off other services that were running on the cluster as they depend on elasticsearch.

Ok, we see el-bauni-lean be elected as master here:

[2019-11-04T09:28:43,083][INFO ][o.e.c.s.MasterService    ] [el-bauni-lean] elected-as-master ([2] nodes joined)[{el-bauni-lean}{bYCS8DxfQeOo4tTM86MAXw}{gYOGHMf2QJi6PiuOURl6CA}{lean.snerpa.local}{172.24.8.152:9300}{dim}{xpack.installed=true} elect leader, {el-bauni-bean}{RPgnf3XBS1abbt7YMPz0HQ}{GqLbSXBkRrqVY-bgoiNYeg}{bean.snerpa.local}{172.24.8.150:9300}{dim}{xpack.installed=true} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 229, version: 14737, reason: master node changed {previous [], current [{el-bauni-lean}{bYCS8DxfQeOo4tTM86MAXw}{gYOGHMf2QJi6PiuOURl6CA}{lean.snerpa.local}{172.24.8.152:9300}{dim}{xpack.installed=true}]}, added {{el-bauni-bean}{RPgnf3XBS1abbt7YMPz0HQ}{GqLbSXBkRrqVY-bgoiNYeg}{bean.snerpa.local}{172.24.8.150:9300}{dim}{xpack.installed=true},}

It looks ok for a bit but then it times out publishing a cluster state update:

[2019-11-04T09:32:44,386][DEBUG][o.e.c.s.MasterService    ] [el-bauni-lean] publishing cluster state version [14744]
[2019-11-04T09:32:50,229][TRACE][o.e.g.MetaStateService   ] [el-bauni-lean] [[.watches/0zCWWk5lRK6KXxv-bUThWw]] writing state, reason [version changed from [194] to [195]]
[2019-11-04T09:33:15,749][TRACE][o.e.g.MetaStateService   ] [el-bauni-lean] [[.watches/0zCWWk5lRK6KXxv-bUThWw]] state written
[2019-11-04T09:33:15,749][TRACE][o.e.g.MetaStateService   ] [el-bauni-lean] [_meta] writing state, reason [changed]
[2019-11-04T09:33:30,144][TRACE][o.e.g.MetaStateService   ] [el-bauni-lean] [_meta] state written (generation: 10147)
[2019-11-04T09:33:43,773][INFO ][o.e.c.s.ClusterApplierService] [el-bauni-lean] master node changed {previous [{el-bauni-lean}{bYCS8DxfQeOo4tTM86MAXw}{gYOGHMf2QJi6PiuOURl6CA}{lean.snerpa.local}{172.24.8.152:9300}{dim}{xpack.installed=true}], current []}, term: 229, version: 14743, reason: becoming candidate: Publication.onCompletion(false)
[2019-11-04T09:33:46,463][WARN ][o.e.c.s.MasterService    ] [el-bauni-lean] failing [shard-started StartedShardEntry{shardId [[.watches][0]], allocationId [5q1c6ne5RuS-MEYawB902w], primary term [67], message ...: failed to commit cluster state version [14744]
org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: publication failed
  at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$4.onFailure(Coordinator.java:1429) ~[elasticsearch-7.4.2.jar:7.4.2]
  at org.elasticsearch.action.ActionRunnable.onFailure(ActionRunnable.java:60) ~[elasticsearch-7.4.2.jar:7.4.2]
...
Caused by: org.elasticsearch.ElasticsearchException: publication cancelled before committing: timed out after 30s

Again the reason for the failure is that it is writing metadata egregiously slowly: it took 40 seconds to write the two small metadata files involved in this cluster state update. This still very much looks like misbehaviour on the IO front: this should be closer to milliseconds than minutes.

I also see timeout messages like these which are suggestive of network issues:

[2019-11-04T09:32:37,278][WARN ][o.e.t.TransportService   ] [el-bauni-lean] Received response for a request that has timed out, sent [22822ms] ago, timed out [0ms] ago, action [cluster:monitor/nodes/stats[n]], node [{el-bauni-bean}{RPgnf3XBS1abbt7YMPz0HQ}{GqLbSXBkRrqVY-bgoiNYeg}{bean.snerpa.local}{172.24.8.150:9300}{dim}{xpack.installed=true}], id [485]
[2019-11-04T09:32:43,710][WARN ][o.e.t.TransportService   ] [el-bauni-lean] Received response for a request that has timed out, sent [14929ms] ago, timed out [0ms] ago, action [internal:coordination/fault_detection/follower_check], node [{el-bauni-mean}{T2AxMn8sQTmBwe3g-5GGow}{LcYFhs0SRoGeGzuuvwmNcA}{mean.snerpa.local}{172.24.8.151:9300}{dim}{xpack.installed=true}], id [491]
[2019-11-04T09:33:09,656][WARN ][o.e.t.TransportService   ] [el-bauni-mean] Received response for a request that has timed out, sent [20927ms] ago, timed out [10962ms] ago, action [internal:coordination/fault_detection/leader_check], node [{el-bauni-lean}{bYCS8DxfQeOo4tTM86MAXw}{gYOGHMf2QJi6PiuOURl6CA}{lean.snerpa.local}{172.24.8.152:9300}{dim}{xpack.installed=true}], id [51]
[2019-11-04T09:33:09,735][WARN ][o.e.t.TransportService   ] [el-bauni-lean] Received response for a request that has timed out, sent [32524ms] ago, timed out [17595ms] ago, action [internal:coordination/fault_detection/follower_check], node [{el-bauni-bean}{RPgnf3XBS1abbt7YMPz0HQ}{GqLbSXBkRrqVY-bgoiNYeg}{bean.snerpa.local}{172.24.8.150:9300}{dim}{xpack.installed=true}], id [490]
[2019-11-04T09:34:10,510][WARN ][o.e.t.TransportService   ] [el-bauni-lean] Received response for a request that has timed out, sent [5208ms] ago, timed out [2201ms] ago, action [internal:discovery/request_peers], node [{el-bauni-mean}{T2AxMn8sQTmBwe3g-5GGow}{LcYFhs0SRoGeGzuuvwmNcA}{mean.snerpa.local}{172.24.8.151:9300}{dim}{xpack.installed=true}], id [523]
[2019-11-04T09:34:30,966][WARN ][o.e.t.TransportService   ] [el-bauni-bean] Received response for a request that has timed out, sent [12618ms] ago, timed out [2601ms] ago, action [internal:coordination/fault_detection/follower_check], node [{el-bauni-mean}{T2AxMn8sQTmBwe3g-5GGow}{LcYFhs0SRoGeGzuuvwmNcA}{mean.snerpa.local}{172.24.8.151:9300}{dim}{xpack.installed=true}], id [266]

I think the TLS messages Insufficient buffer remaining for AEAD cipher fragment are also due to network issues. I'm not familiar with Windows Storage Spaces -- is this some kind of network-attached storage? Would network issues explain the super-slow IO that you're seeing?

The way the storage is set up is that it is spread across all the nodes in a cluster. So it can be spread across multiple VM hosts.
It could explain the slowdown but there are no apparent problems in the cluster. Every other VM is working fine and it reports no issues. As far as i can tell everything is working perfectly aside from elasticsearch.

Thanks, I read up some docs on Windows Storage Spaces too and I understand it better now. It's not an environment in which we run any tests. My main question now is whether you're sure you need it for this setup. Elasticsearch really prefers to work with local disks, and handles all the distribution and replication itself since data locality is so important for performance. Adding another redundancy and distribution layer underneath is often asking for trouble, particularly if it doesn't respect locality so well.

Can you try a setup based on local disks and see if the problem persists?

The plan is to move the cluster to a dedicated machine in the near future. I'm using dstat to monitor performance while trying to start the cluster and I'm seeing io wait times spike up to 96% and at a minimum of 22%.

Do you know if there's any way for me to delete indices from the cluster while it's offline? So I can free up storage without losing the whole configuration.

Unfortunately that's not possible - simply deleting files would leave the node in an inconsistent state and it wouldn't be able to start up. However while a node is not running you can move the whole data directory to a new location (as long as it's a perfect copy, permissions and all), adjust path.data and start the node up at its new location. Do you have space to do that?

I don't have the space at the moment but I might be able to make something happen in the next couple of days.

Just FYI in order to elminiate as much as possible on the OS side I set up the cluster using docker on each node. It has the exact same problems.

I will try to move the cluster to a dedicated machine in the next few days.

Thanks for all the help :slight_smile:

I was able to start up the cluster once I moved it to dedicated hardware with SSDs.Thanks for all the help @DavidTurner

1 Like