Issue with Coordinator node down

We are using Elasticsearch 5.x and in production - we have a Java based application that communicates with Elasticsearch using the TCP transport API -- we are passing 4 coordinator node HOST & PORT while establishing the connection...

But what we observed is that when the 1st node in the LIST (one node) - went down -- none of the clients were able to connect and the entire site was down --- that was dependent on elasticsearch.

Thoughts --- feel free to ask for more details if need be -- appreciate urgent response from anyone who can provide any insights...

Thank you
Keyur

Which version, exactly, are you using?

Are you using sniffing (client.transport.sniff)?

Are you keeping the Elasticsearch Client object alive for an extended period of time, or are you creating a new one each time you need to interact with Elasticsearch?

Are there any interesting log messages? Can you reproduce the problem with DEBUG-level logging within org.elasticsearch.client.transport and provide logs?

Version - 5.1.2

We are not using sniffing.

Client is alive for extended period of time and we are reusing for all the queries.

Below is the code snippet creating the client.

Settings settings = Settings.builder()
.put(EL_CLUTER_NAME, clusterName)
.build();
TransportClient client = new PreBuiltTransportClient(settings);
for (String coordinatedNode : StringUtils.split(coordinatedNodes, COMMA)) {
String hostName = substringBeforeLast(coordinatedNode, COLON);
String port = substringAfterLast(coordinatedNode, COLON);
client.addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName(hostName), Integer.valueOf(port)));
}

Our understanding of the sniffing is that once we enable, client will replace coordinated nodes which are provided during the creation with data nodes which it finds using internal cluster state API. Please let us know if this is wrong.

Also what is the use of calling below on the client?

client.connectedNodes(); //We are not doing this but found in other implementations

Will enable DEBUG logs and provide any other log messages.

Thanks. That rules out a few of the more obvious things that might be happening here.

That's it, yes. I wasn't recommending you do this, by the way, I was just asking because the way that the transport client connects to nodes differs depending on whether sniffing is enabled or not.

This returns the list of connected nodes, but has no other effects.

Looking through the logs is the next step to take. It's worth pointing out that the end-of-life date for 5.1.2 is in less than three weeks, and newer versions have seen changes that might relate to this issue. Upgrading is recommended.

Will try to get logs for the event timeframe... but still wanted to check... If we are not using Sniff and if we have 4 coordinator nodes... why would just 1 node failing.. cause all the clients querying capabilities compromised...

Any thought in the interim?

That's what we're trying to work out.

A couple of extra questions:

  • what is the network connection between the client and the coordinating nodes? Is there anything special like firewalls, packet filters, load balancers?
  • when you say "1st node in the LIST (one node) - went down", what exactly do you mean? Did the underlying machine get shutdown? Was Elasticsearch shutdown gracefully, or did it crash? Did the Elasticsearch process competely stop, or did it hang?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.