Anybody else having problems with 5.2.0?

Hi,

I've tried to upgrade a 5.1.2 cluster to 5.2.0, but it couldn't become green: some shards remained unassigned, while others stuck in initializing. I've waited some time, but recovery is a lot more faster (on 5.1.2). Also, when this state is reached, the machines do nothing (compared to 5.1.2, where an initializing node eats CPU and disk IO) and I can't even query _cat/shards, it doesn't return (_cluster/health does).
It doesn't reach a stable state even when I wait 4 times more than the 5.1.2 needs to be green after a full restart.

The log is full with timeouts like this:

[2017-02-01T22:53:29,248][WARN ][o.e.c.NodeConnectionsService] [fe01] failed to connect to node {fe13}{UoA6Otl6RMKcdnmSZqw7ew}{1V5CtapBRAKEk3i2f4ZbBQ}{10.6.145.206}{10.6.145.206:9301} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [fe13][10.6.145.206:9301] connect_timeout[30s]
	at org.elasticsearch.transport.netty4.Netty4Transport.connectToChannels(Netty4Transport.java:370) ~[?:?]
	at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:495) ~[elasticsearch-5.2.0.jar:5.2.0]

while the master log contains entries like this:

[2017-02-01T22:35:12,278][WARN ][o.e.g.GatewayAllocator$InternalReplicaShardAllocator] [esm00] [bucket_huge][2]: failed to list shard for shard_store on node [U42xk5bIRl-7vv96dM0Yug]
[...]
Caused by: org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /data/elasticsearch/store.9301/data/store/nodes/0/indices/TDOQKPmFT-iz3rB1vvODBw/2/index/write.lock
	at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:127) ~[lucene-core-6.4.0.jar:6.4.0 bbe4b08cc1fb673d0c3eb4b8455f23ddc1364124 - jim - 2017-01-17 15:57:29]

(this is a local filesystem and only this node uses it)
Switching back to 5.1.2 solves these problems.

It looks like something went quite wrong during the upgrade. We have upgrade tests as part of the build so I don't think it is a general problem with upgrades. I'm not super familiar with how the lucene index lock works so I'm not sure what is up here.

I've tried to upgrade three times from the same snapshot. At each try I got similar errors.
The cluster runs nicely with 5.1.2, all recoveries run to completion and the cluster becomes green.
At all times I can query _cat/shards, which just freezes after some time in 5.2.0.
I guess if something went wrong, it may be in 5.2.0, that's why I'm asking whether others see similar issues.
Maybe it's not related to the upgrading process itself, but another component, like lucene or netty.

Hi,
I am facing an issue with 5.2.0 too. The nodes are loosing the connection:

    ... 58 more

[2017-02-08T08:09:26,610][DEBUG][o.e.a.a.c.n.i.TransportNodesInfoAction] [xxx] failed to execute on node [KX1THlutQXKbF4lRKlkhEw]
org.elasticsearch.transport.NodeNotConnectedException: [xxx][xxx.xxx.xxxx.xxxx:9300] Node not connected

How did you do the switch back? I need to do the same!

Thank you!

Alex

I did the upgrade in a service window, with no modifications in the cluster. Therefore I could switch back to 5.1.2 with a simple filesystem snapshot rollback.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.