Hello Michael, we are facing similar issues. Were you able to solve this problem in the meantime?
I looked around a lot and coudn't find anything conclusive. The most promising is a case suggesting
a kernel bug (linux). See Github-Issue
What OS are you using?
We are using old linux machines, kernel 3.11.10. I'm thinking of trying that next (if I get the permission to do so)...
Good news, I fixed the issue for good! I want to share what I found. I am in NYC and a few weeks ago elasticon came and brought a number of support engineers offering free help. I came prepared and brought a 30GB Java memory dump that was taken at the time of the problem. I've spent hours staring at that memory dump, but could not make sense of it. The ES engineer took about 30 seconds to find the problem.
The problem had to do with paging and Googlebot. We have millions of documents in our database, and accordingly, have millions of pages of results. Most of our users rarely look past page 3 or 4, but Googlebot would routinely traverse hundreds of thousands of pages deep. I don't truly understand the underlying issue, but apparently, that was the source of the problem. Apparently, when you page to, for example, page 100,000, some information is kept in memory about pages 1 to 999,999. This can suck up your memory fast.
The solution was to not let the user page through more than a few hundred or a few thousand pages (I believe we kept it to a maximum of 3000 pages). The problem went away immediately.
If anyone could offer any additional insight as to why this occurs, I would appreciate it. I did not see anything in the documentation warning about this issue. Also, a big shout-out to the ES engineer who found the issue, I was impressed.
Congratulations. Just an update on our side: We are still facing the issue. We did upgrade/change the OS to a kernel 4.4.0-45-generic, but without success. So, my previouly added link did not help.
Our issue might be related to us using heavy aggregations. So, I wonder perhaps if there is a way to force garbage collection once in while.
Thanks for the advice. However, in our case that wasn't the issue. We finally figured out our issue: We had some bad groovy-query-scripts, that caused memory-leaks. Refer to this github-discussion.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.