Hello Elasticsearch community,
We're stuck while trying to recover our Elasticsearch instance.
Here's our situation :
- We deployed some months ago an instance (1 node only) of elasticsearch in a Kubernetes cluster. (I know it's really undersized especially with our current usage), and we plan to add more nodes/resources
- We're using it for indexing and searching several kind of logs (syslogs, jenkins jobs mainy), so the use case is time-series logs oriented.
- The logs are collected by filebeat, forwarded to logstash then elasticsearch
- The indexes are daily for each source of logs , and we have 5 shards per index =>so we have a lot of shards (~7000) => I already read that this is not good and that we need to merge indexes and probably switch to weekly or monthly indexes.
- At a point in time, the storage was full (1Tb), the cluster went red, and we had to stop the k8s pod, resize the persistent volume to 2Tb
- From that moment, Elastic can't get to yellow state , and what we see is that each time, it starts, begins assigning shards , opening more and more files until reaching the underlying ulimit of 1 million open files , Elasticsearch reports "too many open files errors" and stops, and same happens again when the pod restarts
Can you please help us find out how to make the cluster recover without hitting this limit, so that we can reorganize the indexes, ... ?
I'm trying to close indexes, but the cluster doesn't answer these requests ,
I also set "cluster.routing.allocation.enable": "none" to prevent shard assignment, but still the number of open files keeps growing, ....
We're really stuck and don't know how to recover ...
We're thinking of adding some nodes and start, I don't know if this would help or not ....
We're using version 6.6
Thanks in advance for your help.