Bloom filter codec?

Has anyone had success adding a bloom filter to the codec for any of their
fields?

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-codec.html#bloom-postings

I imagine it'd help reduce IO from (non multi-term) queries that frequently
don't match. Like if you have a field that is very specific and useful for
searching but very rarely matches anything.

It looks like the cost is in the range of 10 bits of heap per term per
segment for a false positive probability around 1%. Meaning it'd be pretty
high if the index had lots of terms - especially if they were in many
segments. But it'd be about 10 bits per value if the values were mostly
unique.

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3X11bwogWi9oFTYFzzO6%2BdnvsOqcEFWG_dB5c%2Boy%3D4Fw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi Nik,

The trade-off is not easy indeed. First, the default terms dictionary can
already save some disk seeks. By storing the prefixes of the terms that are
in the terms dictionary in a FST in memory, it can avoid going to disk when
the term that you are looking up cannot match this FST. A bloom filter
might save a few additional disk seeks but as you said, it's pretty
intensive memory-wise and sometimes that is memory that would just be
better spent on the filesystem cache.

On Thu, Jul 17, 2014 at 4:25 PM, Nikolas Everett nik9000@gmail.com wrote:

Has anyone had success adding a bloom filter to the codec for any of their
fields?

Elasticsearch Platform — Find real-time answers at scale | Elastic

I imagine it'd help reduce IO from (non multi-term) queries that
frequently don't match. Like if you have a field that is very specific and
useful for searching but very rarely matches anything.

It looks like the cost is in the range of 10 bits of heap per term per
segment for a false positive probability around 1%. Meaning it'd be pretty
high if the index had lots of terms - especially if they were in many
segments. But it'd be about 10 bits per value if the values were mostly
unique.

Nik

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3X11bwogWi9oFTYFzzO6%2BdnvsOqcEFWG_dB5c%2Boy%3D4Fw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3X11bwogWi9oFTYFzzO6%2BdnvsOqcEFWG_dB5c%2Boy%3D4Fw%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j52TNTaN8NzNpB5jd-Kms3VuVtn_0ZFVqbt%2B7tfhk%3D1WQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks for replying. I've been looking to reduce my IO. Pushing
everything into an all field is really going to be the biggest thing, I
think, but I was wondering about the bloom filters. It doesn't sound worth
it. It feels like everything but the default codec is pretty unlikely to
be useful?

On Thu, Jul 17, 2014 at 4:31 PM, Adrien Grand <
adrien.grand@elasticsearch.com> wrote:

Hi Nik,

The trade-off is not easy indeed. First, the default terms dictionary can
already save some disk seeks. By storing the prefixes of the terms that are
in the terms dictionary in a FST in memory, it can avoid going to disk when
the term that you are looking up cannot match this FST. A bloom filter
might save a few additional disk seeks but as you said, it's pretty
intensive memory-wise and sometimes that is memory that would just be
better spent on the filesystem cache.

On Thu, Jul 17, 2014 at 4:25 PM, Nikolas Everett nik9000@gmail.com
wrote:

Has anyone had success adding a bloom filter to the codec for any of
their fields?

Elasticsearch Platform — Find real-time answers at scale | Elastic

I imagine it'd help reduce IO from (non multi-term) queries that
frequently don't match. Like if you have a field that is very specific and
useful for searching but very rarely matches anything.

It looks like the cost is in the range of 10 bits of heap per term per
segment for a false positive probability around 1%. Meaning it'd be pretty
high if the index had lots of terms - especially if they were in many
segments. But it'd be about 10 bits per value if the values were mostly
unique.

Nik

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3X11bwogWi9oFTYFzzO6%2BdnvsOqcEFWG_dB5c%2Boy%3D4Fw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3X11bwogWi9oFTYFzzO6%2BdnvsOqcEFWG_dB5c%2Boy%3D4Fw%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j52TNTaN8NzNpB5jd-Kms3VuVtn_0ZFVqbt%2B7tfhk%3D1WQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j52TNTaN8NzNpB5jd-Kms3VuVtn_0ZFVqbt%2B7tfhk%3D1WQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd0dpqMAkLZ%3DOdWfhicO9hcB5ummBrnmTPw7xUG-54G1pQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

On Thu, Jul 17, 2014 at 10:37 PM, Nikolas Everett nik9000@gmail.com wrote:

Thanks for replying. I've been looking to reduce my IO. Pushing
everything into an all field is really going to be the biggest thing, I
think, but I was wondering about the bloom filters. It doesn't sound worth
it. It feels like everything but the default codec is pretty unlikely to
be useful?

Indeed, the default codec tries to make sensible trade-offs and would be
the most useful in most cases.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7ou5PLbCJ744nP_Qk_S5mwfrbrPUxG0dGkn9zYiQwrsw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Are bloom filters still supported in 2.3 ?

No, they are not.