Shards getting unassigned Randomly

rokarajan · February 3, 2020, 9:33pm

Hello ELK forum members,

I'm new on this forum and would like to share a problem we are currently facing with ELK.

We have been running ELK in our Kubernetes(EKS) for few months with version 7.2.0. We have 12 nodes in our cluster( 3xElastic, 6xData, 3xClients) plus 5xKibana. Main purpose of the cluster is to centralise our logs from different platforms. For example; kubernetes logs are pushed using fluentbit agent and legacy applications logs are pushed to the cluster using lambda functions.

We collect in average 200GB data per day and the data retention varies depending on the index 4 days to 30 days. We have about 1200 indices and 4000 shards with about 5TB data. Each nodes(Data and Elastic) have 16GB memory with 8GB assigned to Java.

Now the problem:
We have been having issue with shards getting unassigned randomly without any particular pattern. Make sense; during the index rollout time(daily) there are many new indexes created and takes a some time for the cluster to assign shards to the nodes but it is happening outside that window. We have also isolated this is not related to the curator. My question here in the forum; has anyone experienced similar situation? Is that related to our design? - about 650shards/data node? Any help is really appreciated.

Thanks,

dadoonet · February 4, 2020, 12:32am

Welcome!

You probably have too many shards per node.
We normally expect 20 shards max per gb of heap. With 8gb, that would mean no more than 160 shards per node.
Also at this size, I'd probably recommend having dedicated master eligible nodes.

May I suggest you look at the following resources about sizing:

https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing

And https://www.elastic.co/webinars/using-rally-to-get-your-elasticsearch-cluster-size-right

DavidTurner · February 4, 2020, 1:21am

The server logs likely contain a lot of useful information about why this is happening, but without those logs we can only really speculate. Can you share the logs from all nodes from around the time of the incident? Please redact as little as possible, but if you must redact something then please make it clear it has been redacted. Use https://gist.github.com/ since there will be too much information to quote in full here.

rokarajan · February 4, 2020, 2:29am

Thanks @dadoonet I will go through the video and review our architecture. Thanks for sharing that.

rokarajan · February 4, 2020, 2:50am

Hi @DavidTurner thanks for your reply. Is there any particular logs I need to filter for this one? Master logs or the Data Nodes logs? I could see heaps of logs generated but unsure which one would be helpful?

When it happened recently i managed to grab the output of the _cluster/allocation/explain output as below.

gist.github.com

https://gist.github.com/rokarajan/1be4988977cc0ebeda16b6096911bb81

gistfile1.txt

{
  "index" : "prod-2020.02.03",
  "shard" : 1,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "NODE_LEFT",
    "at" : "2020-02-04T02:20:57.295Z",
    "details" : "node_left [6XH_cNzNT_iTCLzyeeR6uQ]",
    "last_allocation_status" : "no_attempt"

This file has been truncated. show original

thanks

DavidTurner · February 4, 2020, 8:18am

As I said, logs from all nodes. Don't worry about their size, we can help you find the right information amongst the noise.

rokarajan · February 5, 2020, 12:19am

Hi @DavidTurner, I have uploaded the logs here.

Looking forward to hear back from you.

Thanks

DavidTurner · February 5, 2020, 9:58am

The logs from the elected master only contain about three minutes of data, and those three minutes don't include any nodes joining or leaving the cluster:

$ cat elastic-apse2-master-2_logging_elasticsearch-0306f7bbc0cdf35ab47100db1569bf1412a5feb1c446132826d0cdff7bc879b6.log | sed -n '1p;$p' | jq .time
"2020-02-04T23:02:07.397589542Z"
"2020-02-04T23:04:52.986172572Z"

It looks like you have a Kibana instance pointed at the master nodes too. I recommend not doing that - it's generating a lot of noise in the logs, and the master nodes should not be used for external requests.

You might also like to disable monitoring for now, since that is also generating a lot of noise in the logs.

rokarajan · February 5, 2020, 10:46pm

Thanks @DavidTurner!

It's really tricky to get the logs from all nodes when the problem occurs; I will probably try logging them to ELK itself so that I get some persistent logs.

Yes currently; we are pointing Kibana to the master node and we not making the good use of client nodes; do you think its good idea pointing Kibana to them?

In regards to the monitoring, are you referring to the Kibana xpack monitoring? We have below setup for them

xpack.monitoring.enabled: true
xpack.monitoring.kibana.collection.enabled: true
xpack.monitoring.ui.enabled: true
xpack.monitoring.ui.container.elasticsearch.enabled: true

Thanks,
Raj

system · March 4, 2020, 10:46pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Too many unassigned replica shards everyday Elasticsearch	11	15523	February 6, 2018
Unassigned shards in 10 node cluster? Elasticsearch	3	1283	January 24, 2017
Nearly 50% shards marked as 'unassigned' after cluster restart Elasticsearch	3	444	September 27, 2019
Do I have too many shards? Elasticsearch docker	2	125	May 22, 2024
Problem with ELK Cluster, Unassigned Shards Elasticsearch	2	350	September 27, 2019

Shards getting unassigned Randomly

Related topics