Hello everyone.
I'm using Elastic's Elasticsearch 5.3.0 container. I have 2 clusters and 2 tribe nodes.
My Kibana connects to the tribe node, just for searchs.
Recently I've noticed that, when I call the /_cluster/health
API, I can't seem to find the tribe node.
The json below is what I get when calling the API above inside the master node of one of my clusters.
{
"cluster_name" : "app-atlas",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 32,
"number_of_data_nodes" : 30,
"active_primary_shards" : 17143,
"active_shards" : 19187,
...
}
As you can see, this cluster has 30 slave nodes and 2 master nodes. As I said, I have 2 tribe nodes, shouldn't I see 34 nodes, total?
Searching the logs, I find this:
[2017-07-07T20:20:31,789][INFO ][o.e.c.s.ClusterService ] [elasticsearch-master-app-atlas-002] added {{5.3.0-tribe-002/app}{BHZ80VWbQlC_zmNm48PEjw}{xWU7iWEBR7W8EZcv_hAiSg}{XX.XXX.XX.XXX}{XX.XXX.XX.XXX:9301},}, reason: zen-disco-node-join[{5.3.0-tribe-002/app}{BHZ80VWbQlC_zmNm48PEjw}{xWU7iWEBR7W8EZcv_hAiSg}{XX.XXX.XX.XXX}{XX.XXX.XX.XXX:9301}]
[2017-07-07T20:20:33,886][INFO ][o.e.c.s.ClusterService ] [elasticsearch-master-app-atlas-002] removed {{5.3.0-tribe-002/app}{BHZ80VWbQlC_zmNm48PEjw}{xWU7iWEBR7W8EZcv_hAiSg}{XX.XXX.XX.XXX}{XX.XXX.XX.XXX:9301},}, reason: zen-disco-node-failed({5.3.0-tribe-002/app}{BHZ80VWbQlC_zmNm48PEjw}{xWU7iWEBR7W8EZcv_hAiSg}{XX.XXX.XX.XXX}{XX.XXX.XX.XXX:9301}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{5.3.0-tribe-002/app}{BHZ80VWbQlC_zmNm48PEjw}{xWU7iWEBR7W8EZcv_hAiSg}{XX.XXX.XX.XXX}{XX.XXX.XX.XXX:9301} failed to ping, tried [3] times, each with maximum [30s] timeout]
[2017-07-07T20:20:34,910][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-master-app-atlas-002] [gc][1827810] overhead, spent [262ms] collecting in the last [1s]
[2017-07-07T20:20:34,985][INFO ][o.e.c.s.ClusterService ] [elasticsearch-master-app-atlas-002] added {{5.3.0-tribe-001/app}{N903Jaj3TjOwjGkcEqhRJA}{nNMR5v6TTuKOFHo5qLs4IQ}{YY.YYY.YY.YY}{YY.YYY.YY.YY:9301},}, reason: zen-disco-node-join[{5.3.0-tribe-001/app}{N903Jaj3TjOwjGkcEqhRJA}{nNMR5v6TTuKOFHo5qLs4IQ}{YY.YYY.YY.YY}{YY.YYY.YY.YY:9301}]
[2017-07-07T20:20:35,069][WARN ][o.e.a.a.c.n.s.TransportNodesStatsAction] [elasticsearch-master-app-atlas-002] not accumulating exceptions, excluding exception from response
org.elasticsearch.action.FailedNodeException: Failed node [BHZ80VWbQlC_zmNm48PEjw]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:246) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$200(TransportNodesAction.java:160) ~[elasticsearch-5.3.0.jar:5.3.0]
...
Caused by: org.elasticsearch.transport.RemoteTransportException: [5.3.0-tribe-002/app][XX.XXX.XX.XXX:9301][cluster:monitor/nodes/stats[n]]
Caused by: org.elasticsearch.ElasticsearchSecurityException: missing authentication token for action [cluster:monitor/nodes/stats[n]]
at org.elasticsearch.xpack.security.support.Exceptions.authenticationError(Exceptions.java:39) ~[?:?]
...
[2017-07-07T20:20:37,303][INFO ][o.e.c.s.ClusterService ] [elasticsearch-master-app-atlas-002] removed {{5.3.0-tribe-001/app}{N903Jaj3TjOwjGkcEqhRJA}{nNMR5v6TTuKOFHo5qLs4IQ}{YY.YYY.YY.YY}{YY.YYY.YY.YY:9301},}, reason: zen-disco-node-failed({5.3.0-tribe-001/app}{N903Jaj3TjOwjGkcEqhRJA}{nNMR5v6TTuKOFHo5qLs4IQ}{YY.YYY.YY.YY}{YY.YYY.YY.YY:9301}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{5.3.0-tribe-001/app}{N903Jaj3TjOwjGkcEqhRJA}{nNMR5v6TTuKOFHo5qLs4IQ}{YY.YYY.YY.YY}{YY.YYY.YY.YY:9301} failed to ping, tried [3] times, each with maximum [30s] timeout]
I have disabled X-PACK in all nodes through docker-compose, as I show below:
version: '2'
services:
elasticsearch:
image: AAAAAAAAAAAAAAAAAAAAAAAAAAAA
container_name: elasticsearch
environment:
- action.destructive_requires_name=true
- bootstrap.memory_lock=true
- cluster.name=app-atlas
- cluster.routing.allocation.awareness.attributes=rack_id
- cluster.routing.allocation.node_initial_primaries_recoveries=40
- cluster.routing.allocation.node_concurrent_recoveries=40
- discovery.zen.minimum_master_nodes=1
- discovery.zen.master_election.ignore_non_master_pings=true
- discovery.zen.ping.unicast.hosts=XX.XXX.XX.XXX,YY.YYY.YY.YY
- http.port=9200
- http.cors.enabled=true
- indices.recovery.max_bytes_per_sec=400mb
- indices.fielddata.cache.size=20%
- indices.store.throttle.type=none
- node.name=elasticsearch-master-app-atlas-002
- node.master=true
- node.data=false
- node.attr.rack_id=rack_d
- thread_pool.bulk.queue_size=400
- thread_pool.bulk.size=40
- xpack.security.enabled=false
- xpack.monitoring.enabled=false
- xpack.graph.enabled=false
- xpack.watcher.enabled=false
This is my tribe conf:
version: '2'
services:
tribe:
image: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
container_name: tribe
environment:
- node.name=5.3.0-tribe-001
- cluster.name=tribe-atlas
- node.master=false
- node.data=false
- transport.tcp.port=9300
- http.port=9200
- tribe.infra.cluster.name=infra-atlas
- tribe.infra.discovery.zen.ping.unicast.hosts=CC.CCC.CC.CC,DD.DDD.DD.DDD
- tribe.app.cluster.name=app-atlas
- tribe.app.discovery.zen.ping.unicast.hosts=AA.AAA.AA.AA,BB.BBB.BB.BBB
- xpack.watcher.enabled=false
- xpack.monitoring.enabled=false
- xpack.graph.enabled=false
- xpack.security.enabled=false
When I call the /_cluster/health
API inside a tribe node, I get the sum of the total nodes, any given time, as if the node is connected to the clusters, as shown below:
{
"cluster_name" : "tribe-atlas",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 45,
"number_of_data_nodes" : 38,
"active_primary_shards" : 18481,
"active_shards" : 21484,
...
}
As you can see, it shows a total of 45 node as if only one tribe node is connected.
Making things worst, my Kibana, wich connect's only to the tribe nodes, every now and then, show the message Courier Fetch: X of Y shards failed..
When i get the warning above, i get this from Elasticsearch:
...
node: "PzeetkqoS9O32mkCTXUkiw",
reason: {type: "task_cancelled_exception", reason: "cancelled"},
reason: "cancelled",
type: "task_cancelled_exception",
shard: 1,
successful: 3
...
Are those problems related?
Every container that I use is from Elastic, I just add the install of the S3 plugin and uploaded to my private registry.
Does anyone has a clue about what I'm going through?
Unfortunately I can't upgrade to Kibana 5.5 and use the multicluster search.