Hi, we are experiencing some issues with our Elasticsearch cluster: after around 24 hours of usage the nodes report more than 95% of heap usage and they start becoming unresponsive bringing down the whole cluster; we need to constantly restart them if we want to keep it alive. This is the command line we use to start the Elasticsearch process:
/usr/bin/java -Xms25g -Xmx25g -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC -Dfile.encoding=UTF-8 -Djna.nosys=true -Des.path.home=/usr/share/elasticsearch -cp /usr/share/elasticsearch/lib/elasticsearch-2.4.1.jar:/usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch start -p /var/run/elasticsearch/elasticsearch.pid -d -Des.default.path.home=/usr/share/elasticsearch -Des.default.path.logs=/var/log/elasticsearch -Des.default.path.data=/var/lib/elasticsearch -Des.default.path.conf=/etc/elasticsearch
Another interesting detail is that in our CQRS architecture if we attach a cluster only to the write path the heap issue doesn't seem to occur but nodes start climbing up as soon as we attach them to the read path of the system.
This seems to point to some caching issue or to some query that is triggering the behaviour; on the latter, we do have a query to which we provide up to 5000 ids to exclude from the result as in a
WHERE NOT IN (...) SQL query. Is there anything we should do to take care of this specific query?
Any advice on what could be causing this issue or how we can keep the heap usage from blowing up the cluster would be really appreciated.