ES 0.17.0, Lucene 3.3 finite state


(Sebastian Gavarini) #1

Hi all,

If I understood correctly from the docs, finite state spellchecker is now
backported to Lucene 3.3 and so in ElasticSearch 0.17.0. I remember a
discussion in the mailing list some time ago about spellchecking and the
idea of not using a different index but just the same data ones with a
finite state machine doing fuzzy matching in an efficient way, that was
going to be available in Lucene 4. At the time I was going to use the
n-grams spellchecker in an incremental new index, but I held it because this
other FST spellchecker seemed better.

What's the state of all this in ElasticSearch 0.17.0? is it possible to
access that API somehow? Is it possible (and faster) to use the new fuzzy
queries with FST?

Thanks,
Sebastian.


(Shay Banon) #2

FST for this is not part of Lucene 3.3, only in trunk (upcoming 4.0).

On Tue, Jul 19, 2011 at 3:22 AM, Sebastian Gavarini sgavarini@gmail.comwrote:

Hi all,

If I understood correctly from the docs, finite state spellchecker is now
backported to Lucene 3.3 and so in ElasticSearch 0.17.0. I remember a
discussion in the mailing list some time ago about spellchecking and the
idea of not using a different index but just the same data ones with a
finite state machine doing fuzzy matching in an efficient way, that was
going to be available in Lucene 4. At the time I was going to use the
n-grams spellchecker in an incremental new index, but I held it because this
other FST spellchecker seemed better.

What's the state of all this in ElasticSearch 0.17.0? is it possible to
access that API somehow? Is it possible (and faster) to use the new fuzzy
queries with FST?

Thanks,
Sebastian.


(Sebastian Gavarini) #3

Shay,

I have checked Lucene's 3.3 changelist and it reports two new features, one
for the spellchecker, but also another one called FST backport:
https://issues.apache.org/jira/browse/LUCENE-3140
https://issues.apache.org/jira/browse/LUCENE-3135

Isn't the first one what's needed to make fuzzy fst? (granted that I haven't
seen any "query" class in the 3140 patch, which isn't very encouraging).

Is the backported code only good for the separate spellchecker?

Thanks,
Sebastian.

On Mon, Jul 18, 2011 at 9:30 PM, Shay Banon shay.banon@elasticsearch.comwrote:

FST for this is not part of Lucene 3.3, only in trunk (upcoming 4.0).

On Tue, Jul 19, 2011 at 3:22 AM, Sebastian Gavarini sgavarini@gmail.comwrote:

Hi all,

If I understood correctly from the docs, finite state spellchecker is now
backported to Lucene 3.3 and so in ElasticSearch 0.17.0. I remember a
discussion in the mailing list some time ago about spellchecking and the
idea of not using a different index but just the same data ones with a
finite state machine doing fuzzy matching in an efficient way, that was
going to be available in Lucene 4. At the time I was going to use the
n-grams spellchecker in an incremental new index, but I held it because this
other FST spellchecker seemed better.

What's the state of all this in ElasticSearch 0.17.0? is it possible to
access that API somehow? Is it possible (and faster) to use the new fuzzy
queries with FST?

Thanks,
Sebastian.


#4

On Mon, Jul 18, 2011 at 9:01 PM, Sebastian Gavarini sgavarini@gmail.com wrote:

Shay,
I have checked Lucene's 3.3 changelist and it reports two new features, one
for the spellchecker, but also another one called FST backport:
https://issues.apache.org/jira/browse/LUCENE-3140
https://issues.apache.org/jira/browse/LUCENE-3135
Isn't the first one what's needed to make fuzzy fst? (granted that I haven't
seen any "query" class in the 3140 patch, which isn't very encouraging).
Is the backported code only good for the separate spellchecker?

Hi, there are two different pieces of finite-state functionality in lucene:

  • FSA (finite state automaton). think of this as HashSet: this is
    totally in 4.0-only
    This is used mostly for things that go directly against the index:
    It implements AutomatonQuery, WildcardQuery, RegexpQuery,
    FuzzyQuery, and DirectSpellChecker in 4.0
  • FST (finite state transducer). this of this as HashMap: this is in
    3.x, and 4.0
    This is used as a general datastructure, to hold things (e.g. terms
    -> something).
    It implements the terms index and memory codec in 4.0-only
    But, it implements auto-suggest and soon also, synonyms in 3.x and 4.0

So, the fst backport here only applies to suggest, which fills a
separate FST from the lucene index.
The spellchecker stuff in 4.0-only is different, it turns the users
query into an FSA and runs it against the lucene index directly.

for more information, see Dawid Weiss' presentation here:
http://www.lucidimagination.com/sites/default/files/Weiss%20Dawid%20-%20Finite%20State%20Automata%20in%20Lucene.pdf

--
lucidimagination.com


(Sebastian Gavarini) #5

Thanks for the explanation Robert, I'll wait for 4.0 then.

On Mon, Jul 18, 2011 at 10:33 PM, Robert Muir rcmuir@gmail.com wrote:

On Mon, Jul 18, 2011 at 9:01 PM, Sebastian Gavarini sgavarini@gmail.com
wrote:

Shay,
I have checked Lucene's 3.3 changelist and it reports two new features,
one
for the spellchecker, but also another one called FST backport:
https://issues.apache.org/jira/browse/LUCENE-3140
https://issues.apache.org/jira/browse/LUCENE-3135
Isn't the first one what's needed to make fuzzy fst? (granted that I
haven't
seen any "query" class in the 3140 patch, which isn't very
encouraging).
Is the backported code only good for the separate spellchecker?

Hi, there are two different pieces of finite-state functionality in lucene:

  • FSA (finite state automaton). think of this as HashSet: this is
    totally in 4.0-only
    This is used mostly for things that go directly against the index:
    It implements AutomatonQuery, WildcardQuery, RegexpQuery,
    FuzzyQuery, and DirectSpellChecker in 4.0
  • FST (finite state transducer). this of this as HashMap: this is in
    3.x, and 4.0
    This is used as a general datastructure, to hold things (e.g. terms
    -> something).
    It implements the terms index and memory codec in 4.0-only
    But, it implements auto-suggest and soon also, synonyms in 3.x and 4.0

So, the fst backport here only applies to suggest, which fills a
separate FST from the lucene index.
The spellchecker stuff in 4.0-only is different, it turns the users
query into an FSA and runs it against the lucene index directly.

for more information, see Dawid Weiss' presentation here:

http://www.lucidimagination.com/sites/default/files/Weiss%20Dawid%20-%20Finite%20State%20Automata%20in%20Lucene.pdf

--
lucidimagination.com


(system) #6