What is the protocol used by elasticsearch for unicast zen discovery?


(Carmelo) #1

In the official doc they talk about ping.

Are we talking about ICMP at network layer?
Is there maybe TCP?
What else?


(Mark Walkom) #2

The ping is TCP.


(Carmelo) #3

Ping is ICMP, it's a protocol of the Network layer. It's not TCP.
TCP is a protocol of the transport layer.

So, are you saying that elasticsearch uses a 'ping check' at transport layer?
Correct?

Thanks


(Mark Walkom) #4

In the context of ES pinging a node for discovery, that "ping" is done using TCP.

I realise a literal ping is done via ICMP though.


(Carmelo) #5

It makes sense, thanks.
I'm still trying to figure out how to solve my 'failed to ping' issue.

Actually I also notice that the cluster got stuck after running for a few hours and I see these errors in the logs:

failed to list shard

[2015-05-27 05:05:29,015][WARN ][gateway.local            ] [hostneme148] [config][2]: failed to list shard state on node [TsCyLTuVT32aVt4TQH9HNg]
org.elasticsearch.action.FailedNodeException: Failed node [TsCyLTuVT32aVt4TQH9HNg]
        at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.onFailure(TransportNodesOperationAction.java:206)
        at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access$1000(TransportNodesOperationAction.java:97)
        at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction$4.handleException(TransportNodesOperationAction.java:178)
        at org.elasticsearch.transport.TransportService$3.run(TransportService.java:288)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.SendRequestTransportException: [hostname383][inet[/xx.xx.xx.34:9300]][internal:gateway/local/started_shards[n]]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:284)
        at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.start(TransportNodesOperationAction.java:165)
        at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access$300(TransportNodesOperationAction.java:97)
        at org.elasticsearch.action.support.nodes.TransportNodesOperationAction.doExecute(TransportNodesOperationAction.java:70)
        at org.elasticsearch.action.support.nodes.TransportNodesOperationAction.doExecute(TransportNodesOperationAction.java:43)
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:75)
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:55)
        at org.elasticsearch.gateway.local.state.shards.TransportNodesListGatewayStartedShards.list(TransportNodesListGatewayStartedShards.java:66)
        at org.elasticsearch.gateway.local.LocalGatewayAllocator.buildShardStates(LocalGatewayAllocator.java:407)
        at org.elasticsearch.gateway.local.LocalGatewayAllocator.allocateUnassigned(LocalGatewayAllocator.java:127)
        at org.elasticsearch.cluster.routing.allocation.allocator.ShardsAllocators.allocateUnassigned(ShardsAllocators.java:74)
        at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:219)
        at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:162)
        at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:148)
        at org.elasticsearch.discovery.zen.ZenDiscovery$6.execute(ZenDiscovery.java:563)
        at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:365)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:188)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:158)
        ... 3 more

NodeNotConnectedException

Caused by: org.elasticsearch.transport.NodeNotConnectedException: [hostname383][inet[/xx.xx.xx.34:9300]] Node not connected
        at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:936)
        at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:629)
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:270)
        ... 20 more
[2015-05-27 05:05:29,017][WARN ][gateway.local            ] [hostname148] [checks][2]: failed to list shard state on node [TsCyLTuVT32aVt4TQH9HNg]
org.elasticsearch.action.FailedNodeException: Failed node [TsCyLTuVT32aVt4TQH9HNg]
        at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.onFailure(TransportNodesOperationAction.java:206)
        at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access$1000(TransportNodesOperationAction.java:97)
        at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction$4.handleException(TransportNodesOperationAction.java:178)
        at org.elasticsearch.transport.TransportService$3.run(TransportService.java:288)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.SendRequestTransportException: [hostname383][inet[/10.38.213.34:9300]][internal:gateway/local/started_shards[n]]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:284)
        at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.start(TransportNodesOperationAction.java:165)
        at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access$300(TransportNodesOperationAction.java:97)
        at org.elasticsearch.action.support.nodes.TransportNodesOperationAction.doExecute(TransportNodesOperationAction.java:70)
        at org.elasticsearch.action.support.nodes.TransportNodesOperationAction.doExecute(TransportNodesOperationAction.java:43)
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:75)
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:55)
        at org.elasticsearch.gateway.local.state.shards.TransportNodesListGatewayStartedShards.list(TransportNodesListGatewayStartedShards.java:66)
        at org.elasticsearch.gateway.local.LocalGatewayAllocator.buildShardStates(LocalGatewayAllocator.java:407)
        at org.elasticsearch.gateway.local.LocalGatewayAllocator.allocateUnassigned(LocalGatewayAllocator.java:127)
        at org.elasticsearch.cluster.routing.allocation.allocator.ShardsAllocators.allocateUnassigned(ShardsAllocators.java:74)
        at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:219)
        at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:162)
        at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:148)
        at org.elasticsearch.discovery.zen.ZenDiscovery$6.execute(ZenDiscovery.java:563)
        at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:365)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:188)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:158)
        ... 3 more
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [hostname383][inet[/xx.xx.xx.34:9300]] Node not connected
        at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:936)
        at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:629)
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:270)
        ... 20 more

transport disconnected

[2015-05-27 05:05:29,019][INFO ][cluster.service          ] [hostname148] removed {[hostname036][1bSE8cf5Sw6FlXbyqR3ocQ][hostname036][inet[/xx.xx.xx.137:9300]]{data=false, master=false},}, reason: zen-disco-node_failed([hostname036][1bSE8cf5Sw6FlXbyqR3ocQ][nhostname036][inet[/xx.xx.xx.137:9300]]{data=false, master=false}), reason transport disconnected
[2015-05-27 05:05:29,053][INFO ][cluster.service          ] [hostname148] removed {[hostname383][TsCyLTuVT32aVt4TQH9HNg][hostname383][inet[/xx.xx.xx.34:9300]],}, reason: zen-disco-node_failed([hostname383][TsCyLTuVT32aVt4TQH9HNg][hostname383][inet[/xx.xx.xx.34:9300]]), reason transport disconnected

I found something similar here: ES 1.4.2 random node disconnect ยท Issue #9212 - https://github.com/elastic/elasticsearch/issues/9212
but I haven't overloaded my cluster yet and It's running elasticsearch 1.5, not 1.4.2.

Do you (or someone else) have some some advice to address me in order to solve the issue?
Any thoughts?

Thanks always


(Mark Walkom) #6

Can you show us your config (minus comments/empty lines please)?

Are the hosts in the same DC? Are you measuring network latency, or your cluster and it's nodes using Marvel or similar?


(Carmelo) #7

Sure,
I have 5 master/data nodes + 2 client nodes (just es load balancers).
This is my config:

cluster.name: elasticsearch_dev
node.name: "<FQDN>"
path.data: /esdata1, /esdata2

discovery.zen.minimum_master_nodes: 1

gateway.recover_after_nodes: 3
gateway.recover_after_time: 5m
gateway.expected_nodes: 5
discovery.zen.ping.multicast.enabled: false

## fqdn everywhere 
discovery.zen.ping.unicast.hosts: ["hostname272:9300", "hostname383:9300", "hostname587:9300", "hostname148:9300", "hostname038:9300","hostname036:9300","hostname108:9300"]

index.number_of_replicas: 2

discovery.zen.ping.timeout: 30s
bootstrap.mlockall: true


## added later but same behavior 
# Add fault detection
discovery.zen.fd.ping_interval: 15s
discovery.zen.fd.ping_timeout: 60s
discovery.zen.fd.ping_retries: 5

_nodes
http://pastebin.com/iwZ1Z6na

_nodes/stats
http://pastebin.com/NuQ2Gxjk

The hosts are in the same DC. They are virtual machines (probaby on the same physical server, but I'm not sure of it).
I installed Marvel, but I got additional ERRORS in the logs, then I uninstalled it.

At the moment I'm checking the network running continuously from each master/data node:

  1. a fping to the other hosts every 3 seconds (ICMP ping) and all the hosts are always alive
Wed May 27 16:12:38 AEST 2015 -- Sleep --
hostname272 is alive
hostname383 is alive
hostname587 is alive
hostname148 is alive
hostname038 is alive
hostname036 is alive
hosntame108 is alive
Wed May 27 16:12:41 AEST 2015 -- Sleep --
hostname272 is alive
hostname383 is alive
hostname587 is alive
hostname148 is alive
hostname038 is alive
hostname036 is alive
hosntame108 is alive
  1. a hping to the other hosts o port 9300 (1 TCP Syn Packet every 2 seconds)

but this check is causing another exception on es cluster, so I will stop doing it:

StreamCorruptedException: invalid internal transport message format, got (47,45,54,20)

If there is something other I could check, please let me know.

Thanks


(system) #8