We are using Elasticsearch 5.x and in production - we have a Java based application that communicates with Elasticsearch using the TCP transport API -- we are passing 4 coordinator node HOST & PORT while establishing the connection...
But what we observed is that when the 1st node in the LIST (one node) - went down -- none of the clients were able to connect and the entire site was down --- that was dependent on elasticsearch.
Thoughts --- feel free to ask for more details if need be -- appreciate urgent response from anyone who can provide any insights...
Are you keeping the Elasticsearch Client object alive for an extended period of time, or are you creating a new one each time you need to interact with Elasticsearch?
Are there any interesting log messages? Can you reproduce the problem with DEBUG-level logging within org.elasticsearch.client.transport and provide logs?
Our understanding of the sniffing is that once we enable, client will replace coordinated nodes which are provided during the creation with data nodes which it finds using internal cluster state API. Please let us know if this is wrong.
Also what is the use of calling below on the client?
client.connectedNodes(); //We are not doing this but found in other implementations
Will enable DEBUG logs and provide any other log messages.
Thanks. That rules out a few of the more obvious things that might be happening here.
That's it, yes. I wasn't recommending you do this, by the way, I was just asking because the way that the transport client connects to nodes differs depending on whether sniffing is enabled or not.
This returns the list of connected nodes, but has no other effects.
Looking through the logs is the next step to take. It's worth pointing out that the end-of-life date for 5.1.2 is in less than three weeks, and newer versions have seen changes that might relate to this issue. Upgrading is recommended.
Will try to get logs for the event timeframe... but still wanted to check... If we are not using Sniff and if we have 4 coordinator nodes... why would just 1 node failing.. cause all the clients querying capabilities compromised...
what is the network connection between the client and the coordinating nodes? Is there anything special like firewalls, packet filters, load balancers?
when you say "1st node in the LIST (one node) - went down", what exactly do you mean? Did the underlying machine get shutdown? Was Elasticsearch shutdown gracefully, or did it crash? Did the Elasticsearch process competely stop, or did it hang?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.