I imagine it'd help reduce IO from (non multi-term) queries that frequently
don't match. Like if you have a field that is very specific and useful for
searching but very rarely matches anything.
It looks like the cost is in the range of 10 bits of heap per term per
segment for a false positive probability around 1%. Meaning it'd be pretty
high if the index had lots of terms - especially if they were in many
segments. But it'd be about 10 bits per value if the values were mostly
unique.
The trade-off is not easy indeed. First, the default terms dictionary can
already save some disk seeks. By storing the prefixes of the terms that are
in the terms dictionary in a FST in memory, it can avoid going to disk when
the term that you are looking up cannot match this FST. A bloom filter
might save a few additional disk seeks but as you said, it's pretty
intensive memory-wise and sometimes that is memory that would just be
better spent on the filesystem cache.
On Thu, Jul 17, 2014 at 4:25 PM, Nikolas Everett nik9000@gmail.com wrote:
Has anyone had success adding a bloom filter to the codec for any of their
fields?
I imagine it'd help reduce IO from (non multi-term) queries that
frequently don't match. Like if you have a field that is very specific and
useful for searching but very rarely matches anything.
It looks like the cost is in the range of 10 bits of heap per term per
segment for a false positive probability around 1%. Meaning it'd be pretty
high if the index had lots of terms - especially if they were in many
segments. But it'd be about 10 bits per value if the values were mostly
unique.
Thanks for replying. I've been looking to reduce my IO. Pushing
everything into an all field is really going to be the biggest thing, I
think, but I was wondering about the bloom filters. It doesn't sound worth
it. It feels like everything but the default codec is pretty unlikely to
be useful?
The trade-off is not easy indeed. First, the default terms dictionary can
already save some disk seeks. By storing the prefixes of the terms that are
in the terms dictionary in a FST in memory, it can avoid going to disk when
the term that you are looking up cannot match this FST. A bloom filter
might save a few additional disk seeks but as you said, it's pretty
intensive memory-wise and sometimes that is memory that would just be
better spent on the filesystem cache.
On Thu, Jul 17, 2014 at 4:25 PM, Nikolas Everett nik9000@gmail.com
wrote:
Has anyone had success adding a bloom filter to the codec for any of
their fields?
I imagine it'd help reduce IO from (non multi-term) queries that
frequently don't match. Like if you have a field that is very specific and
useful for searching but very rarely matches anything.
It looks like the cost is in the range of 10 bits of heap per term per
segment for a false positive probability around 1%. Meaning it'd be pretty
high if the index had lots of terms - especially if they were in many
segments. But it'd be about 10 bits per value if the values were mostly
unique.
On Thu, Jul 17, 2014 at 10:37 PM, Nikolas Everett nik9000@gmail.com wrote:
Thanks for replying. I've been looking to reduce my IO. Pushing
everything into an all field is really going to be the biggest thing, I
think, but I was wondering about the bloom filters. It doesn't sound worth
it. It feels like everything but the default codec is pretty unlikely to
be useful?
Indeed, the default codec tries to make sensible trade-offs and would be
the most useful in most cases.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.