Is there an easy way (even if not entirely accurate) to estimate the size of an individual filter in the filter cache if we know the approximate number of documents the index holds? Realize it's a bit tricky as filter cache is node-level, not index-level by default.
If it were straight non-sparse bitset then a filter would be n-bits in size: 1B documents = 1B bits = 125MB but I'm guessing ES tries to use a more clever implementation of a bitset.
I guess that's an easy way to estimate things but in our prod environment, we can't ensure that only one query is executing against the node unless we pull that node out of the cluster (which wouldn't be a great idea).
Doing some research, seems like ES uses a SparseFixedBitSet underneath the hood for caching filters. From the docs:
A bit set that only stores longs that have at least one bit which is set. The way it works is that the space of bits is divided into blocks of 4096 bits, which is 64 longs. Then for each block, we have:
a long which stores the non-zero longs for that block
a long so that bit i being set means that the i-th long of the block is non-null, and its offset in the array of longs is the number of one bits on the right of the i-th bit.
Which is a bit tricky to understand. Not exactly sure how to translate that to our case for an estimate.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.