I've heard about the high memory (heap) occupancy in case FST.
Is there any thumb rule I can use to calculate the size of FST / heap given the amount of data that'll be fed as input to suggester. Any benchmarking studies focussing on this is also welcome.
This will help us decide between completion suggester and other less memory intensive approaches.
We're trying to evaluate completion suggester for our use-case. The fields to autocomplete are short, consisting of 4 words maximum
We're tokenizing them ourselves at indexing time and indexing the array as input so as to support search on any token (not just the prefix)
We're just enabling completion on one field which can consist of maximum 4 words and we also limit max_input_length to 50 chars.
I checked the node stats as well as index stats for "completion" field.
The node stats clearly mention that "size in bytes" is memory usage (not disk)..
However, for index stats, I'm not sure whether "size in bytes" means RAM / disk.
It was 96 MB for a single shard. (192 MB for index with 2 shards)
We had 10 indices on which I had indexed a completion field. These 10 are just clones of each other. I only use 1 of them for my queries.
However, each of them is occupying the same amount in memory.. Does this mean that there is no optimization based on frequency of usage ?
All indices FSTs are always kept in memory ?
And I also see that this number cleanly doubles because of 2 shards per index too.
I'd assume that if it is memory, it's probably all in heap although heap stats don't have breakup of completion related fields.
Indeed, FST are kept in memory in the heap (loaded from disk). There is no optimization based on usage, this data structure exists as is, independent from your usage.
Thanks for clarification @spinscale
So I read through all the metrics again and my understanding is that metrics with just "size_in_bytes" are actually disk size metrics while metrics with "memory" in name like "memory_in_bytes", "terms_memory_in_bytes" are RAM metrics.
Going by that logic, since "completion" field has just "size_in_bytes", it should be the disk metric right ? (Or I am wrong)
If it is disk size metric, I did not find any metric with "memory" in name that could correspond to just "completion". I can see "terms_memory_in_bytes",
"stored_fields_memory_in_bytes",
"term_vectors_memory_in_bytes",
"norms_memory_in_bytes" but they don't seem to be specific to completion. Could u lead me to the memory metric for that.
so when digging through the code, this is loading the FST data structure in heap and then using that for the size calculation. So this is indeed memory based.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.