Unexpected "too many FDs" error on local ES cluster

(Pradeep Reddy) #1

I have elasticsearch 6.4.2 running on Mac, inside a container (Docker Desktop). I have about 240 indices and 1100 shards with 0 replicas set. I probably have 100k docs at max (can't say exact count because the cluster is in Red because of FDs error)

I see errors in ES logs that says too many open files, when I check lsof | grep elasticsearch | wc -l it's giving about 800k+ FDs.

This seems to be strange, considering that number of indices isn't large and data is also small.
I have seen bigger clusters(in terms of indices/dac count) with far few FDs, I am not sure wha's the issue with my local ES cluster.

ES Data directory stats
5200 directories, 33015 files.
Majority of these files are translog files, I am not sure if this is normal or not.

Node stats endpoints gives this about FDs
{
"nodes": {
"pSuUtbN-T4e0Ovf03sUS6g": {
"process": {
"max_file_descriptors": 655366
}
}
}
}

ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 47843
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 655366
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

(Christian Dahlqvist) #2

Why do you have so many indices and shards for so little data? It does indeed look strange though. How did you index the data?

(Pradeep Reddy) #3

Hi, this is my local cluster and hence some testing which led to those many indices. Yes, I am aware of the issues/overhead with high number of shards.
Data is indexed using the bulk APIs, using elasticsearch-js

I am also seeing a similar issue on a VM with 100 indices with 200-300k FDs

What's strange is that, half of this 200k belong to files related to translog.
For ex EDfVKTP2Tni0mrm6K0RrTA doesn't have any docs, but has 1700 open FDs
test_20190422122112_6l7qj_7f6bpg7p2v_data_ocr5v0_launchpad_data_customer_viz_ns_default EDfVKTP2Tni0mrm6K0RrTA 5 1 0 0 1.2kb 1.2kb

(David Turner) #4

I think there is some confusion about what lsof is showing. The output here is 1700 lines long but only shows 10 distinct file descriptors in use:

$ cat fds.txt | awk '{ print $5 }' | sort -u
320r
325u
330r
331u
333u
334r
336r
338r
341u
342u
REG

Each file descriptor is listed multiple times, once for each of the 170 threads.

(Pradeep Reddy) #5

@DavidTurner Thanks.

I didn't know about the per thread entry there.

But, I haven't noticed such a huge number even if you account for duplicates in other clusters.
Also, this cluster is completely idle when I took these stats about FD, I wonder why there are so many threads running.

Also, I have deleted few indices and updated by dev cluster to 6.7.2 and the cluster has started without any issues.
Perhaps, even though the FD limit in the container is high, limits on MacOS and hyperkit vm are low. I will update if I run into FD error again.

(David Turner) #6

Elasticsearch uses thread pools to avoid the overhead of creating a new thread each time it's needed. It's possible that the JVM does something similar too. Keeping idle threads alive is normally pretty cheap so 170 threads isn't too surprising. You might get more useful information from these APIs:

GET _nodes/hot_threads?ignore_idle_threads=false

GET _nodes/stats/thread_pool
1 Like