Anybody else having problems with 5.2.0?

Attila_Nagy · February 2, 2017, 12:55am

Hi,

I've tried to upgrade a 5.1.2 cluster to 5.2.0, but it couldn't become green: some shards remained unassigned, while others stuck in initializing. I've waited some time, but recovery is a lot more faster (on 5.1.2). Also, when this state is reached, the machines do nothing (compared to 5.1.2, where an initializing node eats CPU and disk IO) and I can't even query _cat/shards, it doesn't return (_cluster/health does).
It doesn't reach a stable state even when I wait 4 times more than the 5.1.2 needs to be green after a full restart.

The log is full with timeouts like this:

[2017-02-01T22:53:29,248][WARN ][o.e.c.NodeConnectionsService] [fe01] failed to connect to node {fe13}{UoA6Otl6RMKcdnmSZqw7ew}{1V5CtapBRAKEk3i2f4ZbBQ}{10.6.145.206}{10.6.145.206:9301} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [fe13][10.6.145.206:9301] connect_timeout[30s]
	at org.elasticsearch.transport.netty4.Netty4Transport.connectToChannels(Netty4Transport.java:370) ~[?:?]
	at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:495) ~[elasticsearch-5.2.0.jar:5.2.0]

while the master log contains entries like this:

[2017-02-01T22:35:12,278][WARN ][o.e.g.GatewayAllocator$InternalReplicaShardAllocator] [esm00] [bucket_huge][2]: failed to list shard for shard_store on node [U42xk5bIRl-7vv96dM0Yug]
[...]
Caused by: org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /data/elasticsearch/store.9301/data/store/nodes/0/indices/TDOQKPmFT-iz3rB1vvODBw/2/index/write.lock
	at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:127) ~[lucene-core-6.4.0.jar:6.4.0 bbe4b08cc1fb673d0c3eb4b8455f23ddc1364124 - jim - 2017-01-17 15:57:29]

(this is a local filesystem and only this node uses it)
Switching back to 5.1.2 solves these problems.

nik9000 · February 2, 2017, 1:56am

It looks like something went quite wrong during the upgrade. We have upgrade tests as part of the build so I don't think it is a general problem with upgrades. I'm not super familiar with how the lucene index lock works so I'm not sure what is up here.

Attila_Nagy · February 2, 2017, 3:07am

I've tried to upgrade three times from the same snapshot. At each try I got similar errors.
The cluster runs nicely with 5.1.2, all recoveries run to completion and the cluster becomes green.
At all times I can query _cat/shards, which just freezes after some time in 5.2.0.
I guess if something went wrong, it may be in 5.2.0, that's why I'm asking whether others see similar issues.
Maybe it's not related to the upgrading process itself, but another component, like lucene or netty.

altux · February 8, 2017, 9:42am

Hi,
I am facing an issue with 5.2.0 too. The nodes are loosing the connection:

    ... 58 more

[2017-02-08T08:09:26,610][DEBUG][o.e.a.a.c.n.i.TransportNodesInfoAction] [xxx] failed to execute on node [KX1THlutQXKbF4lRKlkhEw]
org.elasticsearch.transport.NodeNotConnectedException: [xxx][xxx.xxx.xxxx.xxxx:9300] Node not connected

How did you do the switch back? I need to do the same!

Thank you!

Alex

Attila_Nagy · February 8, 2017, 9:44am

I did the upgrade in a service window, with no modifications in the cluster. Therefore I could switch back to 5.1.2 with a simple filesystem snapshot rollback.

system · March 8, 2017, 9:45am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Problems upgrading to 1.5.0 Elasticsearch	1	414	July 6, 2017
Elasticsearch cluster instability Elasticsearch	13	2821	July 6, 2017
Upgrade from 5.2.1 to 5.4.1 Elasticsearch	3	937	July 13, 2017
Cluster green but primary shard errors Elasticsearch	1	557	June 27, 2017
Elasticsearch-5.0.2 -> SERVICE_UNAVAILABLE/1/state not recovered / initialized Elasticsearch	2	5672	April 11, 2018

Anybody else having problems with 5.2.0?

Related topics