there is a elasticsearch cluster which version is 2.4.6 in our company. We met a strange problem recently. After Added nodes into this cluster, the client will met a large of number requests status become 499 or 502( now the cluster has finished rebalance), the requests are timeout. but after we remove the new nodes , thisclient become ok in less than two minutes.
the old machine's os is centos 6.10 and the new machine's os is centos 7.1. there are no any other differences between old nodes and new nodes.
and there are no error logs in cluster.
there are no hot nodes while the client has timeout request. I have checked the distribution of the shards, the shard balance and index balance are both ok.
do you meet some cases like this? can somebody help me? Or give me some ideas on how to troubleshoot this problem
thanks.
but upgrade the version is not a simple thing in production environment,espechially the 2.x. and if I upgrade version of cluster, the applications need upgrade the client version (we have too many applications use the client.) so we have to still use this version for at least half of year.
to resolve my issue, do you have any good advice ? or any advice to find the reason?
I have no idea. For the server side, I'd use the same OS, same JVM, etc for all the nodes.
On the client side, make sure that you are using the exact same version. May be few things changed. I remember that in the past we were using Java to serialize some objects and having multiple versions on the Transport Layer (which your TransportClient is using) was causing issues.
Elasticsearch fixed that a long time ago. Can't really remember when. Well. It was 3.5 years ago
In short:
Upgrade
If you can't:
check that you have the same up to date JVM on all instances and the client side.
BTW, I find that in the client side, our system create more than one java transpotclient instances.
In another discuss( TranspotClient thread sage ), you mentioned the whole jvm need only one client instance. why if there are multiple instances of java client , it will be not thread safe? and what will happen if create multiple clients? I haven't find any explanations in the official document. could you give the answer or some error cases by multiple client instances or the document url ?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.