Hello. We have an Es cluster where the document count continues to grow over time. The heap usage seems to grow with the document count. At some point, the cluster craps out.
We were wondering if we split our indices into "active" and "archived", where 99% of our queries would be hitting the active indices, and the active indices are bound in document count... would that help with the heap usage?
For example, the active indices would have 10M documents at all times, and the archived indices just continued to grow over time, but 99% of our queries would be hitting the active indices... would that curb the heap usage?
There is an overhead associated with the number of shards on a node, regardless of whether they are "active" or "archived," to use your parlance. The only difference is if they are in state close, but then they are not queryable. This overhead, in my experience, has had much more of an effect on heap than the number of documents in an index, or all indices on a node.
As far as heap usage goes outside the aforementioned shards-per-node overhead, it will depend on the kind of queries and aggregations you are using, and how big the sample of data is. And, indeed, it could be this that is affecting you.
I'm much more inclined to believe that the former is the culprit, rather than the latter, though. To give a ballpark, with a 30G heap, the sweet spot is around 600 shards per node. It can fluctuate a bit higher or lower than that, depending on the queries and usage patterns, but that's a good ballpark figure. So what happens when you exceed that number? Memory pressure mounts rapidly. Perhaps the very kind you're starting to see.
How many shards per node do you have? And what is your heap size?
If you have indices no longer being written to, you can potentially reduce heap usage by reducing the number of segments as outlined in this blog post. Note that this is an I/O intensive operation, and therefore probably best performed off-peak if possible.
If you are on a recent version of Elasticsearch and your shards are not very large, you might also be able to reduce the shard count through the shrink index API.
Indeed, this exceeds the recommended max, as your heap is 16G, but the shards per node is 487. This could definitely explain the memory pressure you're seeing.
I agree with @Christian_Dahlqvist, here. Forcemerge and Shrink could help alleviate some of your memory pressure.
As a temporary solution, do you see any issue with raising our heap from 16 gb to 22-24 gb (considering the nodes have 31 gb total mem)?
From reading the linked articles, we're considering reducing our shards per index to 2 and then doing index rollover for indices that grow large (most of indices don't; they stay small) while using aliases to read/write the indices that have been rolled over.
That's less than ideal, but I think that increasing memory will help some.
I also agree with using rollover indices and targeting a size per index, rather than daily new indices. That way, even the ones that stay small can benefit (unless they're not constantly getting a new stream of data, in which case rollover isn't a good fit).
Are you generally updating and/or deleting data from your indices? If this is the case using rollover might complicate things as you would only be able to update/delete the current index through the write alias.
Upping the memory completely resolved our issues, and our cluster health is the best it's been in about 2 years! Cpu usage went from very erratic, to stable and low. Same with old GC time. Heap usage on each node creeps to 75%, but then GC bumps it down substantially (this was not happing before).
We are still wanting to do index rollover as a more long term solution. We don't have the perfect use case of time series data though; we will be updating old indices... but the things we're indexing have a monotonically increasing id, so we can use that to determine which index to write to.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.