Tribe node disconnecting and Kibana showing warnings

Hello everyone.

I'm using Elastic's Elasticsearch 5.3.0 container. I have 2 clusters and 2 tribe nodes.
My Kibana connects to the tribe node, just for searchs.

Recently I've noticed that, when I call the /_cluster/health API, I can't seem to find the tribe node.
The json below is what I get when calling the API above inside the master node of one of my clusters.

{
  "cluster_name" : "app-atlas",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 32,
  "number_of_data_nodes" : 30,
  "active_primary_shards" : 17143,
  "active_shards" : 19187,
...
}

As you can see, this cluster has 30 slave nodes and 2 master nodes. As I said, I have 2 tribe nodes, shouldn't I see 34 nodes, total?

Searching the logs, I find this:

[2017-07-07T20:20:31,789][INFO ][o.e.c.s.ClusterService   ] [elasticsearch-master-app-atlas-002] added {{5.3.0-tribe-002/app}{BHZ80VWbQlC_zmNm48PEjw}{xWU7iWEBR7W8EZcv_hAiSg}{XX.XXX.XX.XXX}{XX.XXX.XX.XXX:9301},}, reason: zen-disco-node-join[{5.3.0-tribe-002/app}{BHZ80VWbQlC_zmNm48PEjw}{xWU7iWEBR7W8EZcv_hAiSg}{XX.XXX.XX.XXX}{XX.XXX.XX.XXX:9301}]
[2017-07-07T20:20:33,886][INFO ][o.e.c.s.ClusterService   ] [elasticsearch-master-app-atlas-002] removed {{5.3.0-tribe-002/app}{BHZ80VWbQlC_zmNm48PEjw}{xWU7iWEBR7W8EZcv_hAiSg}{XX.XXX.XX.XXX}{XX.XXX.XX.XXX:9301},}, reason: zen-disco-node-failed({5.3.0-tribe-002/app}{BHZ80VWbQlC_zmNm48PEjw}{xWU7iWEBR7W8EZcv_hAiSg}{XX.XXX.XX.XXX}{XX.XXX.XX.XXX:9301}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{5.3.0-tribe-002/app}{BHZ80VWbQlC_zmNm48PEjw}{xWU7iWEBR7W8EZcv_hAiSg}{XX.XXX.XX.XXX}{XX.XXX.XX.XXX:9301} failed to ping, tried [3] times, each with maximum [30s] timeout]
[2017-07-07T20:20:34,910][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-master-app-atlas-002] [gc][1827810] overhead, spent [262ms] collecting in the last [1s]
[2017-07-07T20:20:34,985][INFO ][o.e.c.s.ClusterService   ] [elasticsearch-master-app-atlas-002] added {{5.3.0-tribe-001/app}{N903Jaj3TjOwjGkcEqhRJA}{nNMR5v6TTuKOFHo5qLs4IQ}{YY.YYY.YY.YY}{YY.YYY.YY.YY:9301},}, reason: zen-disco-node-join[{5.3.0-tribe-001/app}{N903Jaj3TjOwjGkcEqhRJA}{nNMR5v6TTuKOFHo5qLs4IQ}{YY.YYY.YY.YY}{YY.YYY.YY.YY:9301}]
[2017-07-07T20:20:35,069][WARN ][o.e.a.a.c.n.s.TransportNodesStatsAction] [elasticsearch-master-app-atlas-002] not accumulating exceptions, excluding exception from response
org.elasticsearch.action.FailedNodeException: Failed node [BHZ80VWbQlC_zmNm48PEjw]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:246) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$200(TransportNodesAction.java:160) ~[elasticsearch-5.3.0.jar:5.3.0]
...
Caused by: org.elasticsearch.transport.RemoteTransportException: [5.3.0-tribe-002/app][XX.XXX.XX.XXX:9301][cluster:monitor/nodes/stats[n]]
Caused by: org.elasticsearch.ElasticsearchSecurityException: missing authentication token for action [cluster:monitor/nodes/stats[n]]
	at org.elasticsearch.xpack.security.support.Exceptions.authenticationError(Exceptions.java:39) ~[?:?]
...
[2017-07-07T20:20:37,303][INFO ][o.e.c.s.ClusterService   ] [elasticsearch-master-app-atlas-002] removed {{5.3.0-tribe-001/app}{N903Jaj3TjOwjGkcEqhRJA}{nNMR5v6TTuKOFHo5qLs4IQ}{YY.YYY.YY.YY}{YY.YYY.YY.YY:9301},}, reason: zen-disco-node-failed({5.3.0-tribe-001/app}{N903Jaj3TjOwjGkcEqhRJA}{nNMR5v6TTuKOFHo5qLs4IQ}{YY.YYY.YY.YY}{YY.YYY.YY.YY:9301}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{5.3.0-tribe-001/app}{N903Jaj3TjOwjGkcEqhRJA}{nNMR5v6TTuKOFHo5qLs4IQ}{YY.YYY.YY.YY}{YY.YYY.YY.YY:9301} failed to ping, tried [3] times, each with maximum [30s] timeout]

I have disabled X-PACK in all nodes through docker-compose, as I show below:

version: '2'
services:
  elasticsearch:
    image: AAAAAAAAAAAAAAAAAAAAAAAAAAAA
    container_name: elasticsearch
    environment:
      - action.destructive_requires_name=true
      - bootstrap.memory_lock=true
      - cluster.name=app-atlas
      - cluster.routing.allocation.awareness.attributes=rack_id
      - cluster.routing.allocation.node_initial_primaries_recoveries=40
      - cluster.routing.allocation.node_concurrent_recoveries=40
      - discovery.zen.minimum_master_nodes=1
      - discovery.zen.master_election.ignore_non_master_pings=true
      - discovery.zen.ping.unicast.hosts=XX.XXX.XX.XXX,YY.YYY.YY.YY
      - http.port=9200
      - http.cors.enabled=true
      - indices.recovery.max_bytes_per_sec=400mb
      - indices.fielddata.cache.size=20%
      - indices.store.throttle.type=none
      - node.name=elasticsearch-master-app-atlas-002
      - node.master=true
      - node.data=false
      - node.attr.rack_id=rack_d
      - thread_pool.bulk.queue_size=400
      - thread_pool.bulk.size=40
      - xpack.security.enabled=false
      - xpack.monitoring.enabled=false
      - xpack.graph.enabled=false
      - xpack.watcher.enabled=false

This is my tribe conf:

version: '2'
services:
  tribe:
    image: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
    container_name: tribe
    environment:
      - node.name=5.3.0-tribe-001
      - cluster.name=tribe-atlas
      - node.master=false
      - node.data=false
      - transport.tcp.port=9300
      - http.port=9200
      - tribe.infra.cluster.name=infra-atlas
      - tribe.infra.discovery.zen.ping.unicast.hosts=CC.CCC.CC.CC,DD.DDD.DD.DDD
      - tribe.app.cluster.name=app-atlas
      - tribe.app.discovery.zen.ping.unicast.hosts=AA.AAA.AA.AA,BB.BBB.BB.BBB
      - xpack.watcher.enabled=false
      - xpack.monitoring.enabled=false
      - xpack.graph.enabled=false
      - xpack.security.enabled=false

When I call the /_cluster/health API inside a tribe node, I get the sum of the total nodes, any given time, as if the node is connected to the clusters, as shown below:

{
  "cluster_name" : "tribe-atlas",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 45,
  "number_of_data_nodes" : 38,
  "active_primary_shards" : 18481,
  "active_shards" : 21484,
...
}

As you can see, it shows a total of 45 node as if only one tribe node is connected.

Making things worst, my Kibana, wich connect's only to the tribe nodes, every now and then, show the message Courier Fetch: X of Y shards failed..

When i get the warning above, i get this from Elasticsearch:

...
node: "PzeetkqoS9O32mkCTXUkiw",
reason: {type: "task_cancelled_exception", reason: "cancelled"},
reason: "cancelled",
type: "task_cancelled_exception",
shard: 1,
successful: 3
...

Are those problems related?
Every container that I use is from Elastic, I just add the install of the S3 plugin and uploaded to my private registry.
Does anyone has a clue about what I'm going through?
Unfortunately I can't upgrade to Kibana 5.5 and use the multicluster search.

Hello again.

While testing what could be done I tried remove X-PACK plugin from 5.3 and, apparently it works. I mean, I got no more timeout errors, got no more warnings and errors (remember, I'm testing).

If no more errors pop out I'll just use it without X-PACK instead of disabling X-PACK.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.