Hi,
I've tried to upgrade a 5.1.2 cluster to 5.2.0, but it couldn't become green: some shards remained unassigned, while others stuck in initializing. I've waited some time, but recovery is a lot more faster (on 5.1.2). Also, when this state is reached, the machines do nothing (compared to 5.1.2, where an initializing node eats CPU and disk IO) and I can't even query _cat/shards, it doesn't return (_cluster/health does).
It doesn't reach a stable state even when I wait 4 times more than the 5.1.2 needs to be green after a full restart.
The log is full with timeouts like this:
[2017-02-01T22:53:29,248][WARN ][o.e.c.NodeConnectionsService] [fe01] failed to connect to node {fe13}{UoA6Otl6RMKcdnmSZqw7ew}{1V5CtapBRAKEk3i2f4ZbBQ}{10.6.145.206}{10.6.145.206:9301} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [fe13][10.6.145.206:9301] connect_timeout[30s]
at org.elasticsearch.transport.netty4.Netty4Transport.connectToChannels(Netty4Transport.java:370) ~[?:?]
at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:495) ~[elasticsearch-5.2.0.jar:5.2.0]
while the master log contains entries like this:
[2017-02-01T22:35:12,278][WARN ][o.e.g.GatewayAllocator$InternalReplicaShardAllocator] [esm00] [bucket_huge][2]: failed to list shard for shard_store on node [U42xk5bIRl-7vv96dM0Yug]
[...]
Caused by: org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /data/elasticsearch/store.9301/data/store/nodes/0/indices/TDOQKPmFT-iz3rB1vvODBw/2/index/write.lock
at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:127) ~[lucene-core-6.4.0.jar:6.4.0 bbe4b08cc1fb673d0c3eb4b8455f23ddc1364124 - jim - 2017-01-17 15:57:29]
(this is a local filesystem and only this node uses it)
Switching back to 5.1.2 solves these problems.