ES 0.17.0, Lucene 3.3 finite state

Sebastian_Gavarini · July 19, 2011, 12:22am

Hi all,

If I understood correctly from the docs, finite state spellchecker is now
backported to Lucene 3.3 and so in ElasticSearch 0.17.0. I remember a
discussion in the mailing list some time ago about spellchecking and the
idea of not using a different index but just the same data ones with a
finite state machine doing fuzzy matching in an efficient way, that was
going to be available in Lucene 4. At the time I was going to use the
n-grams spellchecker in an incremental new index, but I held it because this
other FST spellchecker seemed better.

What's the state of all this in ElasticSearch 0.17.0? is it possible to
access that API somehow? Is it possible (and faster) to use the new fuzzy
queries with FST?

Thanks,
Sebastian.

kimchy · July 19, 2011, 12:30am

FST for this is not part of Lucene 3.3, only in trunk (upcoming 4.0).

On Tue, Jul 19, 2011 at 3:22 AM, Sebastian Gavarini sgavarini@gmail.comwrote:

Hi all,

If I understood correctly from the docs, finite state spellchecker is now
backported to Lucene 3.3 and so in Elasticsearch 0.17.0. I remember a
discussion in the mailing list some time ago about spellchecking and the
idea of not using a different index but just the same data ones with a
finite state machine doing fuzzy matching in an efficient way, that was
going to be available in Lucene 4. At the time I was going to use the
n-grams spellchecker in an incremental new index, but I held it because this
other FST spellchecker seemed better.

What's the state of all this in Elasticsearch 0.17.0? is it possible to
access that API somehow? Is it possible (and faster) to use the new fuzzy
queries with FST?

Thanks,
Sebastian.

Sebastian_Gavarini · July 19, 2011, 1:01am

Shay,

I have checked Lucene's 3.3 changelist and it reports two new features, one
for the spellchecker, but also another one called FST backport:
https://issues.apache.org/jira/browse/LUCENE-3140
https://issues.apache.org/jira/browse/LUCENE-3135

Isn't the first one what's needed to make fuzzy fst? (granted that I haven't
seen any "query" class in the 3140 patch, which isn't very encouraging).

Is the backported code only good for the separate spellchecker?

Thanks,
Sebastian.

On Mon, Jul 18, 2011 at 9:30 PM, Shay Banon shay.banon@elasticsearch.comwrote:

FST for this is not part of Lucene 3.3, only in trunk (upcoming 4.0).

On Tue, Jul 19, 2011 at 3:22 AM, Sebastian Gavarini sgavarini@gmail.comwrote:

Hi all,

If I understood correctly from the docs, finite state spellchecker is now
backported to Lucene 3.3 and so in Elasticsearch 0.17.0. I remember a
discussion in the mailing list some time ago about spellchecking and the
idea of not using a different index but just the same data ones with a
finite state machine doing fuzzy matching in an efficient way, that was
going to be available in Lucene 4. At the time I was going to use the
n-grams spellchecker in an incremental new index, but I held it because this
other FST spellchecker seemed better.

What's the state of all this in Elasticsearch 0.17.0? is it possible to
access that API somehow? Is it possible (and faster) to use the new fuzzy
queries with FST?

Thanks,
Sebastian.

rmuir · July 19, 2011, 1:33am

On Mon, Jul 18, 2011 at 9:01 PM, Sebastian Gavarini sgavarini@gmail.com wrote:

Shay,
I have checked Lucene's 3.3 changelist and it reports two new features, one
for the spellchecker, but also another one called FST backport:
[LUCENE-3140] Backport FSTs to 3.x - ASF JIRA
[LUCENE-3135] backport suggest module to branch 3.x - ASF JIRA
Isn't the first one what's needed to make fuzzy fst? (granted that I haven't
seen any "query" class in the 3140 patch, which isn't very encouraging).
Is the backported code only good for the separate spellchecker?

Hi, there are two different pieces of finite-state functionality in lucene:

FSA (finite state automaton). think of this as HashSet: this is
totally in 4.0-only
This is used mostly for things that go directly against the index:
It implements AutomatonQuery, WildcardQuery, RegexpQuery,
FuzzyQuery, and DirectSpellChecker in 4.0
FST (finite state transducer). this of this as HashMap: this is in
3.x, and 4.0
This is used as a general datastructure, to hold things (e.g. terms
-> something).
It implements the terms index and memory codec in 4.0-only
But, it implements auto-suggest and soon also, synonyms in 3.x and 4.0

So, the fst backport here only applies to suggest, which fills a
separate FST from the lucene index.
The spellchecker stuff in 4.0-only is different, it turns the users
query into an FSA and runs it against the lucene index directly.

for more information, see Dawid Weiss' presentation here:
http://www.lucidimagination.com/sites/default/files/Weiss%20Dawid%20-%20Finite%20State%20Automata%20in%20Lucene.pdf

--

Sebastian_Gavarini · July 19, 2011, 2:02am

Thanks for the explanation Robert, I'll wait for 4.0 then.

On Mon, Jul 18, 2011 at 10:33 PM, Robert Muir rcmuir@gmail.com wrote:

On Mon, Jul 18, 2011 at 9:01 PM, Sebastian Gavarini sgavarini@gmail.com
wrote:

Shay,
I have checked Lucene's 3.3 changelist and it reports two new features,
one
for the spellchecker, but also another one called FST backport:
[LUCENE-3140] Backport FSTs to 3.x - ASF JIRA
[LUCENE-3135] backport suggest module to branch 3.x - ASF JIRA
Isn't the first one what's needed to make fuzzy fst? (granted that I
haven't
seen any "query" class in the 3140 patch, which isn't very
encouraging).
Is the backported code only good for the separate spellchecker?

Hi, there are two different pieces of finite-state functionality in lucene:

FSA (finite state automaton). think of this as HashSet: this is
totally in 4.0-only
This is used mostly for things that go directly against the index:
It implements AutomatonQuery, WildcardQuery, RegexpQuery,
FuzzyQuery, and DirectSpellChecker in 4.0

FST (finite state transducer). this of this as HashMap: this is in
3.x, and 4.0
This is used as a general datastructure, to hold things (e.g. terms
-> something).
It implements the terms index and memory codec in 4.0-only
But, it implements auto-suggest and soon also, synonyms in 3.x and 4.0

So, the fst backport here only applies to suggest, which fills a
separate FST from the lucene index.
The spellchecker stuff in 4.0-only is different, it turns the users
query into an FSA and runs it against the lucene index directly.

for more information, see Dawid Weiss' presentation here:

http://www.lucidimagination.com/sites/default/files/Weiss%20Dawid%20-%20Finite%20State%20Automata%20in%20Lucene.pdf

--
lucidimagination.com

Topic		Replies	Views
Terms API for Spellchecker Elasticsearch	8	984	July 6, 2017
Did you mean Elasticsearch	3	484	July 6, 2017
SpellChecker Elasticsearch	2	273	July 6, 2017
Is Elasticsearch the right tool for a spellchecker? Elasticsearch	1	414	February 14, 2020
[7.0.1] Heap pressure - What are FST's used for? Elasticsearch	3	1903	November 1, 2019

ES 0.17.0, Lucene 3.3 finite state

Related topics