Bloom filter codec?

nik9000 · July 17, 2014, 2:25pm

Has anyone had success adding a bloom filter to the codec for any of their
fields?

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-codec.html#bloom-postings

I imagine it'd help reduce IO from (non multi-term) queries that frequently
don't match. Like if you have a field that is very specific and useful for
searching but very rarely matches anything.

It looks like the cost is in the range of 10 bits of heap per term per
segment for a false positive probability around 1%. Meaning it'd be pretty
high if the index had lots of terms - especially if they were in many
segments. But it'd be about 10 bits per value if the values were mostly
unique.

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3X11bwogWi9oFTYFzzO6%2BdnvsOqcEFWG_dB5c%2Boy%3D4Fw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

jpountz · July 17, 2014, 8:31pm

Hi Nik,

The trade-off is not easy indeed. First, the default terms dictionary can
already save some disk seeks. By storing the prefixes of the terms that are
in the terms dictionary in a FST in memory, it can avoid going to disk when
the term that you are looking up cannot match this FST. A bloom filter
might save a few additional disk seeks but as you said, it's pretty
intensive memory-wise and sometimes that is memory that would just be
better spent on the filesystem cache.

On Thu, Jul 17, 2014 at 4:25 PM, Nikolas Everett nik9000@gmail.com wrote:

Has anyone had success adding a bloom filter to the codec for any of their
fields?

Elasticsearch Platform — Find real-time answers at scale | Elastic

I imagine it'd help reduce IO from (non multi-term) queries that
frequently don't match. Like if you have a field that is very specific and
useful for searching but very rarely matches anything.

It looks like the cost is in the range of 10 bits of heap per term per
segment for a false positive probability around 1%. Meaning it'd be pretty
high if the index had lots of terms - especially if they were in many
segments. But it'd be about 10 bits per value if the values were mostly
unique.

Nik

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3X11bwogWi9oFTYFzzO6%2BdnvsOqcEFWG_dB5c%2Boy%3D4Fw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3X11bwogWi9oFTYFzzO6%2BdnvsOqcEFWG_dB5c%2Boy%3D4Fw%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j52TNTaN8NzNpB5jd-Kms3VuVtn_0ZFVqbt%2B7tfhk%3D1WQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

nik9000 · July 17, 2014, 8:37pm

Thanks for replying. I've been looking to reduce my IO. Pushing
everything into an all field is really going to be the biggest thing, I
think, but I was wondering about the bloom filters. It doesn't sound worth
it. It feels like everything but the default codec is pretty unlikely to
be useful?

On Thu, Jul 17, 2014 at 4:31 PM, Adrien Grand <
adrien.grand@elasticsearch.com> wrote:

Hi Nik,

The trade-off is not easy indeed. First, the default terms dictionary can
already save some disk seeks. By storing the prefixes of the terms that are
in the terms dictionary in a FST in memory, it can avoid going to disk when
the term that you are looking up cannot match this FST. A bloom filter
might save a few additional disk seeks but as you said, it's pretty
intensive memory-wise and sometimes that is memory that would just be
better spent on the filesystem cache.

On Thu, Jul 17, 2014 at 4:25 PM, Nikolas Everett nik9000@gmail.com
wrote:

Has anyone had success adding a bloom filter to the codec for any of
their fields?

Elasticsearch Platform — Find real-time answers at scale | Elastic

I imagine it'd help reduce IO from (non multi-term) queries that
frequently don't match. Like if you have a field that is very specific and
useful for searching but very rarely matches anything.

It looks like the cost is in the range of 10 bits of heap per term per
segment for a false positive probability around 1%. Meaning it'd be pretty
high if the index had lots of terms - especially if they were in many
segments. But it'd be about 10 bits per value if the values were mostly
unique.

Nik

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3X11bwogWi9oFTYFzzO6%2BdnvsOqcEFWG_dB5c%2Boy%3D4Fw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3X11bwogWi9oFTYFzzO6%2BdnvsOqcEFWG_dB5c%2Boy%3D4Fw%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j52TNTaN8NzNpB5jd-Kms3VuVtn_0ZFVqbt%2B7tfhk%3D1WQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j52TNTaN8NzNpB5jd-Kms3VuVtn_0ZFVqbt%2B7tfhk%3D1WQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd0dpqMAkLZ%3DOdWfhicO9hcB5ummBrnmTPw7xUG-54G1pQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

jpountz · July 17, 2014, 10:49pm

On Thu, Jul 17, 2014 at 10:37 PM, Nikolas Everett nik9000@gmail.com wrote:

Thanks for replying. I've been looking to reduce my IO. Pushing
everything into an all field is really going to be the biggest thing, I
think, but I was wondering about the bloom filters. It doesn't sound worth
it. It feels like everything but the default codec is pretty unlikely to
be useful?

Indeed, the default codec tries to make sensible trade-offs and would be
the most useful in most cases.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7ou5PLbCJ744nP_Qk_S5mwfrbrPUxG0dGkn9zYiQwrsw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Arvind_Kumar_Chigura · May 13, 2016, 3:17pm

Are bloom filters still supported in 2.3 ?

jpountz · May 13, 2016, 3:28pm

No, they are not.

Topic		Replies	Views
Use BloomFilter as default codec for uid field Elasticsearch	2	733	September 21, 2018
Why not use bloom filter in es for search? Elasticsearch	6	2678	April 22, 2019
BitSet Filters in ES/Lucene Elasticsearch	1	710	July 6, 2017
[RFC] idea for a near duplicate filter Elasticsearch	2	1264	July 6, 2017
Elastic 5.2.2 Codec BEST performance? Elasticsearch	2	474	May 28, 2017

Bloom filter codec?

Related topics