Dear developers,
please help me understand what FST's (Finite-state transducer) are and what they are actually used for.
Context:
My elasticsearch test instances are under a lot of heap pressure (75%+ even after FullGC) from objects that can not be to be garbagecollected. When opening a heapdump many of these objects are actually part of OnHeapFSTStore objects.
I tried to google them but I still can't answer these questions:
What are FST's used for? Are FST's only used for the completion suggester? I only find it in that context.
If yes can that suggester be disabled in any way if it is not needed?
Are there API's that allow me to extract more information from the nodes regarding the FST's or heap usage generally?
Please give me some pointers where I should start reading.
https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-segments.html size.memory is on top of my head
Note FST is used for termdictionary, sometimes really expensive analysis like ngramms and shingles may derive really huge number of terms. Btw, you examine heap deeper in regards to which data structure hold these FSTs
Dear @Mikhail_Khludnev,
thank you for responding! When seeing this data I think I finally understand what's happening.
So if I have 90% of memory occupied by terms segments that means that the termdictionary is just really big which uses all these FST structures in heap. Thus all I need is to figure out how to decrease either the amount of terms or the in-memory representation of them.
I saw that compound==false segments are a whole lot more expensive in size.memory than the other segments that lucene merged already. From other clusters I know that force-merge can vastly improve the memory footprint. Force-merging is sadly not possible in this case. Both disk-IO and CPU would be fine with more merges. Currently ES has every SSD as separate path.data. This may be a bad design on my part.
So I'll just try increasing index.merge.scheduler.max_thread_count beyond the current maximum of 4 (https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-merge.html)
Currently I use best_compression because I wanted to trade CPU cycles for DISK-IO but I'll also try to revert that.
If there are other merge settings that I should test, please tell me!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.