Client node is frequently getting out from cluster

Hi All,

Greetings !!

I am new to ELK. and below is my setup.

Total 9 nodes (6 data node + 2 client node + 1 Management Node)

Every 12 - 20 mins my both client node are getting out from cluster and its getting joined automatically.

logs from client node:

MasterNotDiscoveredException[NodeDisconnectedException[[BPOConnectDataNode05][10.46.xx.xx:9300][cluster:monitor/health] disconnected]]; nested: NodeDisconnectedException[[BPOConnectDataNode05][10.46.xx.xx:9300][cluster:monitor/health] disconnected];
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$5.onTimeout(TransportMasterNodeAction.java:226)
at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:126)
at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:98)
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.retry(TransportMasterNodeAction.java:211)
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.access$900(TransportMasterNodeAction.java:110)
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.handleException(TransportMasterNodeAction.java:200)
at org.elasticsearch.transport.TransportService$Adapter$3.run(TransportService.java:622)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: NodeDisconnectedException[[BPOConnectDataNode05][10.46.xx.xx:9300][cluster:monitor/health] disconnected]

Please help !!

Welcome to the discuss forum and to ELK!

What is a 'management node' ? Do you mean a master ?

And what version is this, as 'client' nodes were renamed to coordinating nodes quite a while ago.

RAM & Heap size on each of these nodes?

Master is stable and can always provide good health and status quickly? As nodes dropping from cluster are often either network issues (you are on cloud or where?) or the master is not stable, usually without enough heap.

Yes, its a master node.

{
"cluster_name" : "BPOConnectElasticSearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 9,
"number_of_data_nodes" : 6,
"active_primary_shards" : 18232,
"active_shards" : 21453,
"relocating_shards" : 0,
"initializing_shards" : 16, --> It got struck
"unassigned_shards" : 15095,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 4360,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 4515942,
"active_shards_percent_as_number" : 58.672464719396125
}

Version of ES:-
"name" : "BPOConnectManagementNode",
"cluster_name" : "BPOConnectElasticSearch",
"version" : {
"number" : "2.3.4",
"build_hash" : "e455fd0c13dceca8dbbdbb1665d068ae55dabe3f",
"build_timestamp" : "2016-06-30T11:24:31Z",
"build_snapshot" : false,
"lucene_version" : "5.5.0"
},
"tagline" : "You Know, for Search"
as 'client' nodes were renamed to coordinating nodes quite a while ago --> NO

RAM & Heap size -XGET cluster/stats?human&pretty'

{
"timestamp" : 1596266903468,
"cluster_name" : "BPOConnectElasticSearch",
"status" : "red",
"indices" : {
"count" : 3068,
"shards" : {
"total" : 21653,
"primaries" : 18232,
"replication" : 0.18763712154453707,
"index" : {
"shards" : {
"min" : 1,
"max" : 12,
"avg" : 7.0576923076923075
},
"primaries" : {
"min" : 1,
"max" : 6,
"avg" : 5.942633637548892
},
"replication" : {
"min" : 0.0,
"max" : 1.0,
"avg" : 0.1864189482833549
}
}
},
"docs" : {
"count" : 62036871,
"deleted" : 14
},
"store" : {
"size" : "26gb",
"size_in_bytes" : 27935301698,
"throttle_time" : "0s",
"throttle_time_in_millis" : 0
},
"fielddata" : {
"memory_size" : "0b",
"memory_size_in_bytes" : 0,
"evictions" : 0
},
"query_cache" : {
"memory_size" : "0b",
"memory_size_in_bytes" : 0,
"total_count" : 220641,
"hit_count" : 0,
"miss_count" : 220641,
"cache_size" : 0,
"cache_count" : 0,
"evictions" : 0
},
"completion" : {
"size" : "0b",
"size_in_bytes" : 0
},
"segments" : {
"count" : 98415,
"memory" : "841.9mb",
"memory_in_bytes" : 882895054,
"terms_memory" : "737.3mb",
"terms_memory_in_bytes" : 773116690,
"stored_fields_memory" : "37.9mb",
"stored_fields_memory_in_bytes" : 39759160,
"term_vectors_memory" : "0b",
"term_vectors_memory_in_bytes" : 0,
"norms_memory" : "15.2mb",
"norms_memory_in_bytes" : 16017472,
"doc_values_memory" : "51.5mb",
"doc_values_memory_in_bytes" : 54001732,
"index_writer_memory" : "0b",
"index_writer_memory_in_bytes" : 0,
"index_writer_max_memory" : "12.8gb",
"index_writer_max_memory_in_bytes" : 13768130560,
"version_map_memory" : "0b",
"version_map_memory_in_bytes" : 0,
"fixed_bit_set" : "0b",
"fixed_bit_set_memory_in_bytes" : 0
},
"percolate" : {
"total" : 0,
"time" : "0s",
"time_in_millis" : 0,
"current" : 0,
"memory_size_in_bytes" : -1,
"memory_size" : "-1b",
"queries" : 0
}
},
"nodes" : {
"count" : {
"total" : 7,
"master_only" : 1,
"data_only" : 1,
"master_data" : 5,
"client" : 0
},
"versions" : [ "2.3.4" ],
"os" : {
"available_processors" : 56,
"allocated_processors" : 56,
"mem" : {
"total" : "125.4gb",
"total_in_bytes" : 134660542464
},
"names" : [ {
"name" : "Linux",
"count" : 7
} ]
},
"process" : {
"cpu" : {
"percent" : 23
},
"open_file_descriptors" : {
"min" : 400,
"max" : 72903,
"avg" : 28032
}
},
"jvm" : {
"max_uptime" : "2h",
"max_uptime_in_millis" : 7531566,
"versions" : [ {
"version" : "1.8.0_121",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "25.121-b13",
"vm_vendor" : "Oracle Corporation",
"count" : 4
}, {
"version" : "1.8.0_151",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "25.151-b12",
"vm_vendor" : "Oracle Corporation",
"count" : 3
} ],
"mem" : {
"heap_used" : "44.3gb",
"heap_used_in_bytes" : 47587332584,
"heap_max" : "149.5gb",
"heap_max_in_bytes" : 160573161472
},
"threads" : 621
},
"fs" : {
"total" : "1023.6gb",
"total_in_bytes" : 1099166326784,
"free" : "958.4gb",
"free_in_bytes" : 1029079048192,
"available" : "934.4gb",
"available_in_bytes" : 1003326205952,
"spins" : "true"
},
"plugins" :

Master is always stable. Only Client node is getting out from cluster. No its not in cloud its a normal virtual machine which hosted in DC.

Client Node Log
[INFO ][cluster.service ] [BPOConnectClientNode01] detected_master {BPOConnectDataNode05}{DwFeVQAXT1e7OvAWzJl2xg}{10.46.XX.XX}{10.46.xx.xx:9300}{master=true}, added {{BPOConnectDataNode02}{lWiurDCdQBeBlKEaqOKB1Q}{10.46.xx.xx}{10.46.xx.xx:9300}{master=true},{BPOConnectDataNode01}{6L30R7wNSk-cF6IonxFQXw}{10.46.xx.xx}{10.46.xx.xx:9300}{master=true},{BPOConnectClientNode02}{WoJeRKDNQYCfJzA3Ht22RA}{10.46.xx.xx}{10.46.xx.xx:9300}{data=false, master=false},{BPOConnectDataNode05}{DwFeVQAXT1e7OvAWzJl2xg}{10.46.xx.xx}{10.46.xx.xx:9300}{master=true},{BPOConnectClientNode01}{QWTeWqHtS_ChB-qTI0huUQ}{10.46.xx.xx}{10.46.xx.xx:9300}{data=false, master=false},{BPOConnectDataNode06}{FKyK7eILRfOqd_0-YT0l6w}{10.46.xx.xx}{10.46.xx.xx:9300}{master=true},{BPOConnectDataNode03}{E-pvqVh7QXms_8A9EmlJjg}{10.46.xx.xx}{10.46.xx.xx:9300}{master=true},{BPOConnectManagementNode}{pqGpyZVQQLaTHBlp_11r_Q}{10.46.xx.xx}{10.46.xx.xx:9300}{data=false, master=true},{BPOConnectClientNode02}{7LiimhjoR7GIwtsB2oF9fg}{10.46.xx.xx}{10.46.xx.xx:9300}{data=false, master=false},{BPOConnectDataNode04}{NZ1m1vLAQMCZnQ_1MYxYNQ}{10.46.xx.xx}{10.46.xx.xx:9300}{master=false},}, reason: zen-disco-receive(from master [{BPOConnectDataNode05}{DwFeVQAXT1e7OvAWzJl2xg}{10.46.xx.xx}{10.46.xx.xx:9300}{master=true}])
[INFO ][cluster.service ] [BPOConnectClientNode01] removed {{BPOConnectClientNode01}{QWTeWqHtS_ChB-qTI0huUQ}{10.46.xx.xx}{10.46.xx.xx:9300}{data=false, master=false},}, reason: zen-disco-receive(from master [{BPOConnectDataNode05}{DwFeVQAXT1e7OvAWzJl2xg}{10.46.xx.xx}{10.46.xx.xx:9300}{master=true}])
[INFO ][cluster.service ] [BPOConnectClientNode01] removed {{BPOConnectClientNode02}{7LiimhjoR7GIwtsB2oF9fg}{10.46.xx.xx}{10.46.xx.xx:9300}{data=false, master=false},}, reason: zen-disco-receive(from master [{BPOConnectDataNode05}{DwFeVQAXT1e7OvAWzJl2xg}{10.46.xx.xx}{10.46.xx.xx:9300}{master=true}])

Note: We have 6 data node with 35000 shards. 1 master node 5 eligible master node.
After restart of cluster "initializing_shards" got struck and its not doing anything.
Pls Help !!

[2020-08-01 10:29:47,247][INFO ][discovery.zen ] [BPOConnectClientNode01] failed to send join request to master [{BPOConnectDataNode05}{5OAzBTQqRfOzBQM9FjB7Qw}{10.46.xx.xx}{10.46.xx.xx:9300}{master=true}], reason [ElasticsearchTimeoutException[Timeout waiting for task.]]

Note: BPOConnectDataNode05 - Master node

You have far too many shards for a cluster that size. Please read this blog post and look to reduce that significantly. I would also recommend that you upgrade as you are using a very old version that has been EOL quite a while.

1 Like

Hi Christian,

Thanks for reply. I don't want to loose any logs if i delete some of the shards what will happen to logs and replica. Up-gradation is in pipeline i am working on that.

Need your guidance

Current status

"cluster_name" : "BPOConnectElasticSearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 7, --> it should be 9
"number_of_data_nodes" : 6,
"active_primary_shards" : 18232,
"active_shards" : 30978,
"relocating_shards" : 0,
"initializing_shards" : 12, --> got struck
"unassigned_shards" : 5574,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 62244,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 19605593,
"active_shards_percent_as_number" : 84.72267804397768

  heap.percent ram.percent load node.role master name
      33          46 0.24 -         m      BPOConnectManagementNode
       9          59 0.35 d         m      BPOConnectDataNode06
     59          60 0.50 d         m      BPOConnectDataNode03
     44          62 0.44 d         m      BPOConnectDataNode02
    13          61 0.47 d         -      BPOConnectDataNode04
      5          57 0.15 d         m      BPOConnectDataNode01                                                                                    
    79          58 2.50 d         *      BPOConnectDataNode05

Both Client node 1 and 2 are out of cluster.

[INFO ][discovery.zen ] [BPOConnectClientNode01] failed to send join request to master [{BPOConnectDataNode05}{5OAzBTQqRfOzBQM9FjB7Qw}{10.46.112.55}{10.46.XX.XX:9300}{master=true}], reason [ElasticsearchTimeoutException[Timeout waiting for task.]]

Pls help

Suggest you close 90% of your indexes and then reindex in blocks to reduce shards - it's so old, who knows (can you even close in version 2?).

Also all your nodes except 4 can be masters so I assume they are wasting heap with cluster state - maybe you can remove that role from a few - also ideally you can add RAM to at least the real master to get the system stable enough to reindex and reduce shard count by 90%.

1 Like

I just noticed that it looks like you only have 26GB of data in the cluster. Is that correct? That is what I would expect the size of a single shard to be. You should therefore be able to hold all the data in a single index with a few primary shards. If you changed it this way you may also be able to reduce the hardware footprint.

1 Like

Yes, i have done this yesterday night. and the current status is below.

{
"cluster_name" : "BPOConnectElasticSearch",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 9,
"number_of_data_nodes" : 6,
"active_primary_shards" : 18264,
"active_shards" : 36528,
"relocating_shards" : 2,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 1179,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 1073095,
"active_shards_percent_as_number" : 100.0
}

Now problem is logstash shipper. Its not sending logs to ES.

redis-cli LLEN sourceLogstash
(integer) 218402

tail -f logstash_indexer.log-20200802
{:timestamp=>"2020-08-01T18:58:51.111000+0200", :message=>"Attempted to send a bulk request to Elasticsearch configured at '["http://10.46.xx.xx:9200/", "http://10.46.xx.xx:9200/", "http://10.46.xx.xx:9200/", "http://10.46.xx.xx:9200/", "http://10.46.xx.xx:9200/", "http://10.46.xx.xx:9200/"]', but Elasticsearch appears to be unreachable or down!", :error_message=>"Connection refused (Connection refused)", :class=>"Manticore::SocketException", :client_config=>{:hosts=>["http://10.46.xx.xx:9200/", "http://10.46.xx.xx:9200/", "http://10.46.xx.xx:9200/", "http://10.46.xx.xx:9200/", "http://10.46.xx.xx:9200/", "http://10.46.xx.xx:9200/"], :ssl=>nil, :transport_options=>{:socket_timeout=>0, :request_timeout=>0, :proxy=>nil, :ssl=>{}}, :transport_class=>Elasticsearch::Transport::Transport::HTTP::Manticore, :logger=>nil, :tracer=>nil, :reload_connections=>false, :retry_on_failure=>false, :reload_on_failure=>false, :randomize_hosts=>false, :http=>{:scheme=>"http", :user=>nil, :password=>nil, :port=>9200}}, :level=>:error}

Don't know what to do now :frowning:

yeah.. You are 100% correct. I have recently joined this team and i am very new to ELK.
I planned to migrate the RHEL 6 to RHEL 7 and that time i will configure as you said, it is really make sense.

Past 1 week i am struggling to make this UP, now cluster ES and Kibana service is UP but the Logstash is giving trouble now :frowning:

Given the number of nodes you have you may be able to shrink the current cluster so that you can setup a new 3-node Elasticsearch 7.8 cluster and reindex the data into a single index on this using reindex from remote.

Sure i will take this and try to implement the same

One way to prevent it from getting any worse is to start indexing into a single index rather that the large number of exceptionally small ones you have now. Unless you update documents this could be a quick win. You can then reindex old indices into this single index using the reindex API and remove the small indices one they are reindexed. This should reduce the index count over time and stop new indices being generated (making the situation worse).

As creating indices requires the cluster state to be updated and propagated across the cluster (which is slow due to your immense number of shards) this should also make ingestion more reliable.

1 Like