Elasticsearch Zen discovery

We are having an issue with an issue with Ubuntu kernel causing the machine running elasticsearch to hung.
While the machine in this state (it only solve after restart) no API call to elasticsearch (or even SSH to the server) is returning result, though elasticsearch continues to identify the server as part of the cluster (because the way Zen discovery works). It means that the server not responding while the master does not remove it from the cluster.

Any ideas?

This sounds about right.

@dudidu
How many nodes in this cluster?
Can you share the elasticsearch.yml
It would be interesting to see what the discovery.zen.minimum_master_nodes is set to.

Total of 78 elasticsearch nodes:

6 clients nodes
3 master nodes
39 data nodes (hot)
30 data nodes (cold)

elasticsearch.yml

cluster.name: tier-1

node.name: prod-elasticsearch-data-038
node.master: false
node.data: true

node.attr.box_type: L8

http.cors.enabled: true

http.cors.allow-origin: "*"

path.data: /mnt
path.logs: /var/log/elasticsearch

bootstrap.memory_lock: true

network.bind_host: ["_site_", "_local_"]

network.publish_host: _eth0:ipv4_

discovery.zen.ping.unicast.hosts: ["prod-elasticsearch-master-001", "prod-elasticsearch-master-002", "prod-elasticsearch-master-003"]
discovery.zen.minimum_master_nodes: 2

action.destructive_requires_name: false

thread_pool.bulk.queue_size: 3000

discovery.zen.minimum_master_nodes: 2

What version are you on?
What kernel?
What JVM?

That's likely to cause more problems than it is worth.

Elasticsearch version: Version: 6.2.4

JVM version: openjdk 1.8.0_171

OS version: Linux prod-elasticsearch-data-010 4.13.0-1011-azure #14-Ubuntu SMP Thu Feb 15 16:15:39 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

This sounds like it's a known kernel bug: https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1772264

1 Like

There is no simple solution to this. Nodes that are completely working or completely broken are easy to identify, but in this case AIUI the faulty node is neither: it just repeatedly hangs for a while and then starts working again. Each time this happens it looks to the master node like a transient network issue. There is no "correct" way to deal with this to this, only heuristics, and Elasticsearch's heuristics are quite conservative in their detection of faulty nodes because removing a node and reallocating all its shards can be quite expensive.

1 Like

Is there any workaround we can use?

It seems that there is a known issue with ubuntu kernel causing the machine to hung, which currently does not have a fix (A kernel fix should be released in the next couple of weeks).

Can you please elaborate on why and to what it should change?
Without this value we will get rejected.

Rolling back to a non-buggy kernel version is all I can suggest. I've had a quick look, but can't find details on when this bug was introduced, sorry. The discussion about this kernel issue all seems to be in the last month or so.

These people rolled back to 4.11 which fixed it for them, if that helps.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.