We are moving from a very old version of ES 1.4.5 to a more modern version. For a majority of the effort we focused on ES 5.5.0 when rewriting the settings and mapping for our 4 indices. About a month ago we decided to move to the latest 6.x release which was v6.2.4. After some minor updates to the script that populates the index from MySQL and code that uses to call index from our application everything looked fine while testing.
Last week we began populating our entire dataset into ES 6.2.4 and for 3 of the 4 indices this went without any issues. However, for the last index we ran into a weird behavior that as we indexed documents we saw a proportional increase in file descriptors which quickly (within 30 min of start) ran through our 65536 file descriptor limit causing the cluster to go red due to
too many open files error. If I stopped the index script prior to exhausting the file descriptor limit I saw the open file descriptor count remain unchanged.
Our setup runs each Elasticsearch cluster member (data with shared master) in Docker on a dedicated host launched via terraform and Ansible. For v6.2.4 we leveraged
docker.elastic.co/elasticsearch/elasticsearch:6.2.4 along with installing the repository-s3 and discovery-ec2 plugins. As we use terraform I played with sizing the instances both in resources (m4.large->c4.xlarge->m4.2xlarge) with no change in the amount of file descriptors used when progressing to the same point when running the indexing script. Each time destroying and bring up a new cluster when I modified the size. I also played with number of cluster members in the cluster (3->5) and saw a proportional drop as the total number of file descriptors per member got spread across 5 nodes instead of 3.
I came across this: https://github.com/elastic/elasticsearch/pull/29125. This MR is associated to an issue that explains the problem where the file descriptors are associated with tlog files that is caused by an endless flushing loop. I was able to confirm via
lsof -p <es process> that the open file descriptors are associated with tlog files but I am not sure how to confirm a endless flushing loop via logs or API call. I did leverage a elasticsearch-py flush argument called
wait_if_ongoing which should block a flush if there is an active flush, however, when I ran it against all indices it returned immediately. Since my issue seemed to be at least close to MR 29125 and also seeing on Wednesday (June 13) it was merged into v6.3.0 I decided to give that a try using
docker.elastic.co/elasticsearch/elasticsearch-oss:6.3.0. Unfortunately, I saw the same issue present itself, but I did notice that after approx 12 hours the number of open file descriptors recovered to the number before I setup and populated the index right after terraforming a new cluster. I tested this 2 separate times destroying the cluster between and tracked the
/_nodes/stats/process API endpoint for open_file_descriptors and max_file_descriptors for each member over the 12 hour period and almost to the minute I saw it drop after 12 hours had lapsed.
Since we do nightly full re-indexes on 40+ million documents across all our indices to accommodate the memory usage for all the file descriptors we would likely need index the full dataset we would have to significantly oversize our cluster members either by size or by count to work around this issue. To try to rule out a config or version issue I deployed Amazon's hosted Elasticsearch v6.2 (which is v6.2.3) and saw the same behavior with open file descriptors. However, when I terraformed ES v5.6.10 I saw a steady 400-410 open file descriptors count across all cluster members where I typically saw the increase on the v6.x cluster.
We would like to get this working on the latest v6.x release but due to the open file descriptor use on this version we are not able to do some and keep without over provisioning. I would appreciate it if someone could provide some suggestions or things to investigate to see if we can get v6.2.4 or 6.3.0 working.