Hey! So I am running Elasticsearch on somewhat memory-constrained cloud boxes, and after I index a few hundred gigs of data the segments memory begins to approach the amount of memory I've allocated to the process, and soon it starts crashing with OOM and being totally unusable. So what are the factors that determine the amount of segments memory that an index needs? Is there a formula I can use to approximate segments memory usage as a function of # of documents, document size, maybe other factors? Thanks!
hey dave, take a look at these links,
FWIW, I have used the cat api such as this to calculate segment memory:
curl -s -XGET http://localhost:9200/_cat/indices?h=index,dc,fm,fcm,sm
We run multiple clusters in production with a similar use case to ELK but not exactly. Across a sampling of clusters and indices we divide segment memory by the number of documents in an index. Our observations were every document in the store requires 4 - 8 bytes of segment memory. We use doc values. There may be better ways to estimate segment memory requirements but this is our approach.
Right, those APIs tell me how much segments memory is in use by my Elasticsearch, but I want to estimate before I even start the process how many points I can index before memory will start getting tight. Can I calculate this beforehand?
For posterity, if anybody else ever wants to do this, it's probably impossible: Lucene's in-memory inverted index is this goofy thing: http://blog.mikemccandless.com/2010/12/using-finite-state-transducers-in.html and there's no way to predict the size of that data structure based on number of documents added to it.