Stuck with too many open files issue

Hello Elasticsearch community,

We're stuck while trying to recover our Elasticsearch instance.
Here's our situation :

  • We deployed some months ago an instance (1 node only) of elasticsearch in a Kubernetes cluster. (I know it's really undersized especially with our current usage), and we plan to add more nodes/resources
  • We're using it for indexing and searching several kind of logs (syslogs, jenkins jobs mainy), so the use case is time-series logs oriented.
  • The logs are collected by filebeat, forwarded to logstash then elasticsearch
  • The indexes are daily for each source of logs , and we have 5 shards per index =>so we have a lot of shards (~7000) => I already read that this is not good and that we need to merge indexes and probably switch to weekly or monthly indexes.
  • At a point in time, the storage was full (1Tb), the cluster went red, and we had to stop the k8s pod, resize the persistent volume to 2Tb
  • From that moment, Elastic can't get to yellow state , and what we see is that each time, it starts, begins assigning shards , opening more and more files until reaching the underlying ulimit of 1 million open files , Elasticsearch reports "too many open files errors" and stops, and same happens again when the pod restarts

Can you please help us find out how to make the cluster recover without hitting this limit, so that we can reorganize the indexes, ... ?
I'm trying to close indexes, but the cluster doesn't answer these requests ,
I also set "cluster.routing.allocation.enable": "none" to prevent shard assignment, but still the number of open files keeps growing, ....

We're really stuck and don't know how to recover ...
We're thinking of adding some nodes and start, I don't know if this would help or not ....

We're using version 6.6

Thanks in advance for your help.

You should close as many indices as you can as soon as you start Elasticsearch up, which will stop Elasticsearch from opening all these files at once.

Then you need to work out how to have vastly fewer shards. Aim for fewer than 20 per GB of heap (so max 600 per node) but really with only 1TB of data you should be able to use a lot less than that. Possibly you need to reindex your data to achieve that.

Once you have a plan, you can start to open a few indices at a time, take a snapshot in case something goes wrong, process them into fewer shards, and then delete them. Repeat until your cluster is back to health again.

2 Likes

Hello David,

Thanks a lot, your proposition worked as a charm.
Even if there's a lot of work to do to merge the indices, this allows us to recover and get the stack running again.

Wish you a great day ahead.
Youssef.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.