Tribe node disconnecting and Kibana showing warnings

Frederico_Ferreira · July 7, 2017, 9:07pm

Hello everyone.

I'm using Elastic's Elasticsearch 5.3.0 container. I have 2 clusters and 2 tribe nodes.
My Kibana connects to the tribe node, just for searchs.

Recently I've noticed that, when I call the /_cluster/health API, I can't seem to find the tribe node.
The json below is what I get when calling the API above inside the master node of one of my clusters.

{
  "cluster_name" : "app-atlas",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 32,
  "number_of_data_nodes" : 30,
  "active_primary_shards" : 17143,
  "active_shards" : 19187,
...
}

As you can see, this cluster has 30 slave nodes and 2 master nodes. As I said, I have 2 tribe nodes, shouldn't I see 34 nodes, total?

Searching the logs, I find this:

[2017-07-07T20:20:31,789][INFO ][o.e.c.s.ClusterService   ] [elasticsearch-master-app-atlas-002] added {{5.3.0-tribe-002/app}{BHZ80VWbQlC_zmNm48PEjw}{xWU7iWEBR7W8EZcv_hAiSg}{XX.XXX.XX.XXX}{XX.XXX.XX.XXX:9301},}, reason: zen-disco-node-join[{5.3.0-tribe-002/app}{BHZ80VWbQlC_zmNm48PEjw}{xWU7iWEBR7W8EZcv_hAiSg}{XX.XXX.XX.XXX}{XX.XXX.XX.XXX:9301}]
[2017-07-07T20:20:33,886][INFO ][o.e.c.s.ClusterService   ] [elasticsearch-master-app-atlas-002] removed {{5.3.0-tribe-002/app}{BHZ80VWbQlC_zmNm48PEjw}{xWU7iWEBR7W8EZcv_hAiSg}{XX.XXX.XX.XXX}{XX.XXX.XX.XXX:9301},}, reason: zen-disco-node-failed({5.3.0-tribe-002/app}{BHZ80VWbQlC_zmNm48PEjw}{xWU7iWEBR7W8EZcv_hAiSg}{XX.XXX.XX.XXX}{XX.XXX.XX.XXX:9301}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{5.3.0-tribe-002/app}{BHZ80VWbQlC_zmNm48PEjw}{xWU7iWEBR7W8EZcv_hAiSg}{XX.XXX.XX.XXX}{XX.XXX.XX.XXX:9301} failed to ping, tried [3] times, each with maximum [30s] timeout]
[2017-07-07T20:20:34,910][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-master-app-atlas-002] [gc][1827810] overhead, spent [262ms] collecting in the last [1s]
[2017-07-07T20:20:34,985][INFO ][o.e.c.s.ClusterService   ] [elasticsearch-master-app-atlas-002] added {{5.3.0-tribe-001/app}{N903Jaj3TjOwjGkcEqhRJA}{nNMR5v6TTuKOFHo5qLs4IQ}{YY.YYY.YY.YY}{YY.YYY.YY.YY:9301},}, reason: zen-disco-node-join[{5.3.0-tribe-001/app}{N903Jaj3TjOwjGkcEqhRJA}{nNMR5v6TTuKOFHo5qLs4IQ}{YY.YYY.YY.YY}{YY.YYY.YY.YY:9301}]
[2017-07-07T20:20:35,069][WARN ][o.e.a.a.c.n.s.TransportNodesStatsAction] [elasticsearch-master-app-atlas-002] not accumulating exceptions, excluding exception from response
org.elasticsearch.action.FailedNodeException: Failed node [BHZ80VWbQlC_zmNm48PEjw]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:246) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$200(TransportNodesAction.java:160) ~[elasticsearch-5.3.0.jar:5.3.0]
...
Caused by: org.elasticsearch.transport.RemoteTransportException: [5.3.0-tribe-002/app][XX.XXX.XX.XXX:9301][cluster:monitor/nodes/stats[n]]
Caused by: org.elasticsearch.ElasticsearchSecurityException: missing authentication token for action [cluster:monitor/nodes/stats[n]]
	at org.elasticsearch.xpack.security.support.Exceptions.authenticationError(Exceptions.java:39) ~[?:?]
...
[2017-07-07T20:20:37,303][INFO ][o.e.c.s.ClusterService   ] [elasticsearch-master-app-atlas-002] removed {{5.3.0-tribe-001/app}{N903Jaj3TjOwjGkcEqhRJA}{nNMR5v6TTuKOFHo5qLs4IQ}{YY.YYY.YY.YY}{YY.YYY.YY.YY:9301},}, reason: zen-disco-node-failed({5.3.0-tribe-001/app}{N903Jaj3TjOwjGkcEqhRJA}{nNMR5v6TTuKOFHo5qLs4IQ}{YY.YYY.YY.YY}{YY.YYY.YY.YY:9301}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{5.3.0-tribe-001/app}{N903Jaj3TjOwjGkcEqhRJA}{nNMR5v6TTuKOFHo5qLs4IQ}{YY.YYY.YY.YY}{YY.YYY.YY.YY:9301} failed to ping, tried [3] times, each with maximum [30s] timeout]

I have disabled X-PACK in all nodes through docker-compose, as I show below:

version: '2'
services:
  elasticsearch:
    image: AAAAAAAAAAAAAAAAAAAAAAAAAAAA
    container_name: elasticsearch
    environment:
      - action.destructive_requires_name=true
      - bootstrap.memory_lock=true
      - cluster.name=app-atlas
      - cluster.routing.allocation.awareness.attributes=rack_id
      - cluster.routing.allocation.node_initial_primaries_recoveries=40
      - cluster.routing.allocation.node_concurrent_recoveries=40
      - discovery.zen.minimum_master_nodes=1
      - discovery.zen.master_election.ignore_non_master_pings=true
      - discovery.zen.ping.unicast.hosts=XX.XXX.XX.XXX,YY.YYY.YY.YY
      - http.port=9200
      - http.cors.enabled=true
      - indices.recovery.max_bytes_per_sec=400mb
      - indices.fielddata.cache.size=20%
      - indices.store.throttle.type=none
      - node.name=elasticsearch-master-app-atlas-002
      - node.master=true
      - node.data=false
      - node.attr.rack_id=rack_d
      - thread_pool.bulk.queue_size=400
      - thread_pool.bulk.size=40
      - xpack.security.enabled=false
      - xpack.monitoring.enabled=false
      - xpack.graph.enabled=false
      - xpack.watcher.enabled=false

This is my tribe conf:

version: '2'
services:
  tribe:
    image: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
    container_name: tribe
    environment:
      - node.name=5.3.0-tribe-001
      - cluster.name=tribe-atlas
      - node.master=false
      - node.data=false
      - transport.tcp.port=9300
      - http.port=9200
      - tribe.infra.cluster.name=infra-atlas
      - tribe.infra.discovery.zen.ping.unicast.hosts=CC.CCC.CC.CC,DD.DDD.DD.DDD
      - tribe.app.cluster.name=app-atlas
      - tribe.app.discovery.zen.ping.unicast.hosts=AA.AAA.AA.AA,BB.BBB.BB.BBB
      - xpack.watcher.enabled=false
      - xpack.monitoring.enabled=false
      - xpack.graph.enabled=false
      - xpack.security.enabled=false

When I call the /_cluster/health API inside a tribe node, I get the sum of the total nodes, any given time, as if the node is connected to the clusters, as shown below:

{
  "cluster_name" : "tribe-atlas",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 45,
  "number_of_data_nodes" : 38,
  "active_primary_shards" : 18481,
  "active_shards" : 21484,
...
}

As you can see, it shows a total of 45 node as if only one tribe node is connected.

Making things worst, my Kibana, wich connect's only to the tribe nodes, every now and then, show the message Courier Fetch: X of Y shards failed..

When i get the warning above, i get this from Elasticsearch:

...
node: "PzeetkqoS9O32mkCTXUkiw",
reason: {type: "task_cancelled_exception", reason: "cancelled"},
reason: "cancelled",
type: "task_cancelled_exception",
shard: 1,
successful: 3
...

Are those problems related?
Every container that I use is from Elastic, I just add the install of the S3 plugin and uploaded to my private registry.
Does anyone has a clue about what I'm going through?
Unfortunately I can't upgrade to Kibana 5.5 and use the multicluster search.

Frederico_Ferreira · July 10, 2017, 8:16pm

Hello again.

While testing what could be done I tried remove X-PACK plugin from 5.3 and, apparently it works. I mean, I got no more timeout errors, got no more warnings and errors (remember, I'm testing).

If no more errors pop out I'll just use it without X-PACK instead of disabling X-PACK.

system · August 7, 2017, 8:16pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Tribe Node Elasticsearch	5	1387	July 5, 2017
Using tribe node for federation between Elasticsearch clusters Elasticsearch	4	1334	July 5, 2017
Kibana fails in talking to Tribe Node Elasticsearch	4	1699	July 6, 2017
Issues setting up Tribe node Elasticsearch	3	465	July 5, 2017
Kibana on tribe node Kibana	4	854	July 6, 2017

Tribe node disconnecting and Kibana showing warnings

Related topics