I have elasticsearch 6.4.2 running on Mac, inside a container (Docker Desktop). I have about 240 indices and 1100 shards with 0 replicas set. I probably have 100k docs at max (can't say exact count because the cluster is in Red because of FDs error)
I see errors in ES logs that says too many open files, when I check lsof | grep elasticsearch | wc -l it's giving about 800k+ FDs.
This seems to be strange, considering that number of indices isn't large and data is also small.
I have seen bigger clusters(in terms of indices/dac count) with far few FDs, I am not sure wha's the issue with my local ES cluster.
ES Data directory stats
5200 directories, 33015 files.
Majority of these files are translog files, I am not sure if this is normal or not.
Node stats endpoints gives this about FDs
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 47843
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 655366
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Hi, this is my local cluster and hence some testing which led to those many indices. Yes, I am aware of the issues/overhead with high number of shards.
Data is indexed using the bulk APIs, using elasticsearch-js
I am also seeing a similar issue on a VM with 100 indices with 200-300k FDs
What's strange is that, half of this 200k belong to files related to translog.
For ex EDfVKTP2Tni0mrm6K0RrTA doesn't have any docs, but has 1700 open FDs
test_20190422122112_6l7qj_7f6bpg7p2v_data_ocr5v0_launchpad_data_customer_viz_ns_default EDfVKTP2Tni0mrm6K0RrTA 5 1 0 0 1.2kb 1.2kb
But, I haven't noticed such a huge number even if you account for duplicates in other clusters.
Also, this cluster is completely idle when I took these stats about FD, I wonder why there are so many threads running.
Also, I have deleted few indices and updated by dev cluster to 6.7.2 and the cluster has started without any issues.
Perhaps, even though the FD limit in the container is high, limits on MacOS and hyperkit vm are low. I will update if I run into FD error again.
Elasticsearch uses thread pools to avoid the overhead of creating a new thread each time it's needed. It's possible that the JVM does something similar too. Keeping idle threads alive is normally pretty cheap so 170 threads isn't too surprising. You might get more useful information from these APIs:
I did some more testing..
I have sysctl kern.num_files around 3k+ when I don't have ES running.
It goes up to 13k+ with ES running, ES throws the errors shared above. I am not sure if there is a transient state where this number goes high, I did observe few times and it's always 13k+ afaics.
I have decreased the limit to 10k by doing sudo launchctl limit maxfiles 10000, I got the error in ES logs that it needs to be atleast 10240.
So the limits that I am setting seem to work definitely. This makes me wonder if there is a transient state where it goes to pretty high and then comes back? or the error message is misleading and I might be running into some other related limit
Sorry, debugging this kind of issue on a Mac is outside what I can really help with. I know there are OS-level tools like dtrace that might help work out whether Elasticsearch really is opening far too many files or whether there's some other limit that we don't know about, but I have no experience with using them.
@DavidTurner thanks, that's understandable.
As mentioned in by original post, this is definitely something to do with Mac/+ES. I will see if any tracing tools might be helpful to narrow it down.
I was also trying to find a way to simulate this scenario to see if my system's kern.num_files goes beyond 13/14k. If it is, it means that OS can handle more FDs, which might suggest that there could be a transient state where ES might be having more FDs is a likely problem.
If you know of any way to simulate this, pls let me know.