Unexpected "too many FDs" error on local ES cluster

I have elasticsearch 6.4.2 running on Mac, inside a container (Docker Desktop). I have about 240 indices and 1100 shards with 0 replicas set. I probably have 100k docs at max (can't say exact count because the cluster is in Red because of FDs error)

I see errors in ES logs that says too many open files, when I check lsof | grep elasticsearch | wc -l it's giving about 800k+ FDs.

This seems to be strange, considering that number of indices isn't large and data is also small.
I have seen bigger clusters(in terms of indices/dac count) with far few FDs, I am not sure wha's the issue with my local ES cluster.

ES Data directory stats
5200 directories, 33015 files.
Majority of these files are translog files, I am not sure if this is normal or not.

Node stats endpoints gives this about FDs
{
"nodes": {
"pSuUtbN-T4e0Ovf03sUS6g": {
"process": {
"max_file_descriptors": 655366
}
}
}
}

ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 47843
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 655366
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

Why do you have so many indices and shards for so little data? It does indeed look strange though. How did you index the data?

Hi, this is my local cluster and hence some testing which led to those many indices. Yes, I am aware of the issues/overhead with high number of shards.
Data is indexed using the bulk APIs, using elasticsearch-js

I am also seeing a similar issue on a VM with 100 indices with 200-300k FDs

What's strange is that, half of this 200k belong to files related to translog.
For ex EDfVKTP2Tni0mrm6K0RrTA doesn't have any docs, but has 1700 open FDs
test_20190422122112_6l7qj_7f6bpg7p2v_data_ocr5v0_launchpad_data_customer_viz_ns_default EDfVKTP2Tni0mrm6K0RrTA 5 1 0 0 1.2kb 1.2kb

I think there is some confusion about what lsof is showing. The output here is 1700 lines long but only shows 10 distinct file descriptors in use:

$ cat fds.txt | awk '{ print $5 }' | sort -u
320r
325u
330r
331u
333u
334r
336r
338r
341u
342u
REG

Each file descriptor is listed multiple times, once for each of the 170 threads.

@DavidTurner Thanks.

I didn't know about the per thread entry there.

But, I haven't noticed such a huge number even if you account for duplicates in other clusters.
Also, this cluster is completely idle when I took these stats about FD, I wonder why there are so many threads running.

Also, I have deleted few indices and updated by dev cluster to 6.7.2 and the cluster has started without any issues.
Perhaps, even though the FD limit in the container is high, limits on MacOS and hyperkit vm are low. I will update if I run into FD error again.

Elasticsearch uses thread pools to avoid the overhead of creating a new thread each time it's needed. It's possible that the JVM does something similar too. Keeping idle threads alive is normally pretty cheap so 170 threads isn't too surprising. You might get more useful information from these APIs:

GET _nodes/hot_threads?ignore_idle_threads=false

GET _nodes/stats/thread_pool
1 Like

Hi,

I have removed the hyperkit, docker variables now. I have ES running on mac installed via brew.
I still see the error

Error:

No Of Indices: 98 (I had 2x of these indices which I deleted, restarted)
No Of Shards: 473

Host Mac OS ulimit

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 440000
pipe size            (512 bytes, -p) 1
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 2048
virtual memory          (kbytes, -v) unlimited

launchctl limit shows

	cpu         unlimited      unlimited
	filesize    unlimited      unlimited
	data        unlimited      unlimited
	stack       8388608        67104768
	core        0              unlimited
	rss         unlimited      unlimited
	memlock     unlimited      unlimited
	maxproc     2048           2048
	maxfiles    440000         524288

Is it possible that the error is misleading and the actual issue might with some other limit?

I did some more testing..
I have sysctl kern.num_files around 3k+ when I don't have ES running.
It goes up to 13k+ with ES running, ES throws the errors shared above. I am not sure if there is a transient state where this number goes high, I did observe few times and it's always 13k+ afaics.

I have decreased the limit to 10k by doing sudo launchctl limit maxfiles 10000, I got the error in ES logs that it needs to be atleast 10240.

So the limits that I am setting seem to work definitely. This makes me wonder if there is a transient state where it goes to pretty high and then comes back? or the error message is misleading and I might be running into some other related limit

I don't think so. From what I can see, Too many open files maps to the error code EMFILE ...

... and the docs for this error code don't give any alternative explanations:

EMFILE The per-process limit on the number of open file descriptors has been reached (see the description of RLIMIT_NOFILE in getrlimit(2)).

Can you share your cluster's settings (GET _cluster/settings) and also your elasticsearch.yml file(s)?

This is a fresh installation that I just did and copied the data from the docker setup where I ran into this first. So, nothing fancy in my setup.
But, I have attached these to the gist

Btw, same issue is happening on other members of my team who are using macs and have the number of indices that I have more or less.

Sorry, debugging this kind of issue on a Mac is outside what I can really help with. I know there are OS-level tools like dtrace that might help work out whether Elasticsearch really is opening far too many files or whether there's some other limit that we don't know about, but I have no experience with using them.

Can you reproduce this on a supported OS?

@DavidTurner thanks, that's understandable.
As mentioned in by original post, this is definitely something to do with Mac/+ES. I will see if any tracing tools might be helpful to narrow it down.

I was also trying to find a way to simulate this scenario to see if my system's kern.num_files goes beyond 13/14k. If it is, it means that OS can handle more FDs, which might suggest that there could be a transient state where ES might be having more FDs is a likely problem.
If you know of any way to simulate this, pls let me know.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.