Bloom Filter and other Lucene utils

Hi all (but mostly Shay),

I came across this posting from the Grepling blog today (I'm not really familiar with Greplin, the link was passed by a friend): http://tech.blog.greplin.com/lucene-utilities-and-bloom-filters

They describe a number of features that they've developed and made available on Github; in particular they describe their implementation of a Bloom filter. I've noticed that ES incorporates a "SimpleBloomCache" as per this thread:

http://groups.google.com/a/elasticsearch.com/group/users/browse_thread/thread/d167989c71cfd0/1fcc8bea32623e90?lnk=gst

My questions are thus:

  • would the Greplin Bloom implementation be a worthwhile (or even possible) improvement to ES?
  • would any/all of the other features be worth porting to ES?

I'm normally reluctant to suggest large enhancements/features, but in this case it seems that having code already written and openly available might make it a reasonable notion. What are your thoughts?

Cheers,
MJ

Heya,
On Thursday, April 14, 2011 at 6:49 PM, MJ Suhonos wrote:

Hi all (but mostly Shay),

I came across this posting from the Grepling blog today (I'm not really familiar with Greplin, the link was passed by a friend): http://tech.blog.greplin.com/lucene-utilities-and-bloom-filters

They describe a number of features that they've developed and made available on Github; in particular they describe their implementation of a Bloom filter. I've noticed that ES incorporates a "SimpleBloomCache" as per this thread:

http://groups.google.com/a/elasticsearch.com/group/users/browse_thread/thread/d167989c71cfd0/1fcc8bea32623e90?lnk=gst

My questions are thus:

  • would the Greplin Bloom implementation be a worthwhile (or even possible) improvement to ES?
    The bloom filter greplin did is not really related to Lucene, though can be used in certain places when working with Lucene. Had a quick look at the implementation, ES one is different, inspired by Cassandra (which uses Lucene OpenBitSet for it :), round and round the open source goes). The ES implementation will be faster and use less memory, though I have not tested it, just by looking at the code.
  • would any/all of the other features be worth porting to ES?
    The only thing that I see is the phrase query, which can be easily added as another query option (it simply wraps Lucene MultiPhraseQuery).

Actually, the more interesting work Greplin did is with interval fields (its another project on github). That would be a cool feature to have in ES, but I need to review it more before including it in ES.

I'm normally reluctant to suggest large enhancements/features, but in this case it seems that having code already written and openly available might make it a reasonable notion. What are your thoughts?
Don't be reluctant to suggest any type of feature, no matter how big or small, thats the first thought that jumps to mind :slight_smile:

Cheers,
MJ

  • would the Greplin Bloom implementation be a worthwhile (or even possible) improvement to ES?
    ... The ES implementation will be faster and use less memory, though I have not tested it, just by looking at the code.

Fair enough; I'm not surprised of course, but wanted to mention it so you were aware of it.

  • would any/all of the other features be worth porting to ES?
    The only thing that I see is the phrase query, which can be easily added as another query option (it simply wraps Lucene MultiPhraseQuery).

That might be handy.

I'm normally reluctant to suggest large enhancements/features, but in this case it seems that having code already written and openly available might make it a reasonable notion. What are your thoughts?
Don't be reluctant to suggest any type of feature, no matter how big or small, thats the first thought that jumps to mind :slight_smile:

Also noted. :wink:

Thanks as always,
MJ