I'm new on this forum and would like to share a problem we are currently facing with ELK.
We have been running ELK in our Kubernetes(EKS) for few months with version 7.2.0. We have 12 nodes in our cluster( 3xElastic, 6xData, 3xClients) plus 5xKibana. Main purpose of the cluster is to centralise our logs from different platforms. For example; kubernetes logs are pushed using fluentbit agent and legacy applications logs are pushed to the cluster using lambda functions.
We collect in average 200GB data per day and the data retention varies depending on the index 4 days to 30 days. We have about 1200 indices and 4000 shards with about 5TB data. Each nodes(Data and Elastic) have 16GB memory with 8GB assigned to Java.
Now the problem:
We have been having issue with shards getting unassigned randomly without any particular pattern. Make sense; during the index rollout time(daily) there are many new indexes created and takes a some time for the cluster to assign shards to the nodes but it is happening outside that window. We have also isolated this is not related to the curator. My question here in the forum; has anyone experienced similar situation? Is that related to our design? - about 650shards/data node? Any help is really appreciated.
You probably have too many shards per node.
We normally expect 20 shards max per gb of heap. With 8gb, that would mean no more than 160 shards per node.
Also at this size, I'd probably recommend having dedicated master eligible nodes.
May I suggest you look at the following resources about sizing:
The server logs likely contain a lot of useful information about why this is happening, but without those logs we can only really speculate. Can you share the logs from all nodes from around the time of the incident? Please redact as little as possible, but if you must redact something then please make it clear it has been redacted. Use https://gist.github.com/ since there will be too much information to quote in full here.
Hi @DavidTurner thanks for your reply. Is there any particular logs I need to filter for this one? Master logs or the Data Nodes logs? I could see heaps of logs generated but unsure which one would be helpful?
When it happened recently i managed to grab the output of the _cluster/allocation/explain output as below.
The logs from the elected master only contain about three minutes of data, and those three minutes don't include any nodes joining or leaving the cluster:
It looks like you have a Kibana instance pointed at the master nodes too. I recommend not doing that - it's generating a lot of noise in the logs, and the master nodes should not be used for external requests.
You might also like to disable monitoring for now, since that is also generating a lot of noise in the logs.
It's really tricky to get the logs from all nodes when the problem occurs; I will probably try logging them to ELK itself so that I get some persistent logs.
Yes currently; we are pointing Kibana to the master node and we not making the good use of client nodes; do you think its good idea pointing Kibana to them?
In regards to the monitoring, are you referring to the Kibana xpack monitoring? We have below setup for them
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.