Terms API for Spellchecker


(Sebastian Gavarini) #1

Hi,

I am planning to implement a port of Lucene's Spellchecker with
ElasticSearch, for that I need to adapt a couple of things, implement
a token filter for the variable length ngram generation (short words
get short ngrams, long words long ones), do a search on the ngram
field, then do another sort outside of ES, according to the
Levenshtein distance of each term. So far so good, but first I need to
be sure the word is not already in the Index, preventing a spell check
on a already correct word. For that the now defunct Terms API was
probably the best fit with the lowest overhead.

The current implementation of Lucene does the following:

public boolean exist(String word) throws IOException {
if (reader == null) {
reader = IndexReader.open(spellIndex);
}
return reader.docFreq(new Term(F_WORD, word)) > 0;
}

I know it's not a great timing, I wish I was using ES back in June
when the removal was asked/decided, but is there a possibility to get
back in the code the Terms API? if not, is there a similar
alternative, performance wise, to IndexReader.docFreq()?

I think in general having a low level Lucene API is not very useful
for day to day stuff, but for certain things it could open a lot of
possibilities pretty hard to accomplish otherwise.

Thanks,
Sebastian.


#2

On Tue, Oct 12, 2010 at 10:33 PM, Sebastian sgavarini@gmail.com wrote:

I know it's not a great timing, I wish I was using ES back in June
when the removal was asked/decided, but is there a possibility to get
back in the code the Terms API? if not, is there a similar
alternative, performance wise, to IndexReader.docFreq()?

Hello:

At the moment in lucene when you seek to a term, it must read in the
docfreq anyway, see
http://lucene.apache.org/java/3_0_2/fileformats.html#Term Dictionary

So docfreq() and seek are essentially the same...

This exist() is mostly used actually when building the spellchecker
index, not for determining an already correct word. Just because the
word is in the index doesn't mean its correct!

But for the record I am not certain that the lucene spellchecker does
this in the most performant way if you want to update a spellchecker
index with only the new terms that have been added (that seems to be
the usage of this exist()).

For example, if you are updating the spellcheck index from the new
terms in a lucene index, i think it would be faster to traverse both
TermEnums in parallel to find the new ones that must be added...
otherwise you are essentially seeking to each term (think n log n)
when I think you can get order n instead.

lucene's spellchecker doesn't do this though... there are
"seek-within-block" optimizations in lucene (at least in lucene trunk)
that might help with this kind of thing but I still think a parallel
TermEnum traversal is safer.


(Sebastian Gavarini) #3

Hi Robert,

Thanks for your reply, see my comments below:

So docfreq() and seek are essentially the same...
Ok, I didn't know the details, in my case either operation is fine, it
would work the same as the current Lucene Spellchecker. What I want to
avoid is a complete search/facet/scoring/hit-collector request if
possible, because of the lack of terms API.

This exist() is mostly used actually when building the spellchecker
index, not for determining an already correct word. Just because the
word is in the index doesn't mean its correct!
I am planning to build the Spellchecker index as a special index
inside ES (with fields analyzed by a variant of ngram), because there
is no other option right now in ES. That's why I need access to
docFreq() to implement exist(), so as not to suggest variants of a
correct word. I think I understand what you mean by "correct word",
but please correct me if I am wrong, in my case the spellchecking is
good enough if it suggests something already contained in the index, I
don't plan to use a word-list or dictionary. I want to avoid an empty
results search, but don't care for perfect vocabulary.

For example, if you are updating the spellcheck index from the new
terms in a lucene index, i think it would be faster to traverse both
TermEnums in parallel to find the new ones that must be added...
otherwise you are essentially seeking to each term (think n log n)
when I think you can get order n instead.
It's a good idea, I'll add it to my wish list. Right now I wasn't
considering incremental updates, but a full reindex from time to time,
so O(n). For traversing TermEnums, either for incremantal updates as
you suggest, or to get the list of terms for a full rebuild, I would
need access to Lucene again, probably the Terms API or something like
that. Maybe the functionality is already provided by an API in ES, if
so please hint me to the docs.

Thanks,
Sebastian.


(Shay Banon) #4

Hi Sebastian,

Agreed on the terms API, can you open a feature request, I can add it
back. Regarding spell checking, there is very interesting work going on
lucene trunk to have a spell checker that works directly over the actual
index (and no need to build an additional spell check index), so I am
currently leaning towards not having this option, and provide it once that
work is out. I actually started to implement something similar to what is
done in lucene trunk for spellchecking (or more correctly, played with it a
bit), so I might get it in sooner before lucene 4.0 will be out...

-shay.banon

On Wed, Oct 13, 2010 at 7:53 AM, Sebastian sgavarini@gmail.com wrote:

Hi Robert,

Thanks for your reply, see my comments below:

So docfreq() and seek are essentially the same...
Ok, I didn't know the details, in my case either operation is fine, it
would work the same as the current Lucene Spellchecker. What I want to
avoid is a complete search/facet/scoring/hit-collector request if
possible, because of the lack of terms API.

This exist() is mostly used actually when building the spellchecker
index, not for determining an already correct word. Just because the
word is in the index doesn't mean its correct!
I am planning to build the Spellchecker index as a special index
inside ES (with fields analyzed by a variant of ngram), because there
is no other option right now in ES. That's why I need access to
docFreq() to implement exist(), so as not to suggest variants of a
correct word. I think I understand what you mean by "correct word",
but please correct me if I am wrong, in my case the spellchecking is
good enough if it suggests something already contained in the index, I
don't plan to use a word-list or dictionary. I want to avoid an empty
results search, but don't care for perfect vocabulary.

For example, if you are updating the spellcheck index from the new
terms in a lucene index, i think it would be faster to traverse both
TermEnums in parallel to find the new ones that must be added...
otherwise you are essentially seeking to each term (think n log n)
when I think you can get order n instead.
It's a good idea, I'll add it to my wish list. Right now I wasn't
considering incremental updates, but a full reindex from time to time,
so O(n). For traversing TermEnums, either for incremantal updates as
you suggest, or to get the list of terms for a full rebuild, I would
need access to Lucene again, probably the Terms API or something like
that. Maybe the functionality is already provided by an API in ES, if
so please hint me to the docs.

Thanks,
Sebastian.


(Otis Gospodnetić) #5

+1 for going with the new Lucene spellchecker approach (I'm surprised
Robert didn't mention it!) once that's in the release and skipping the
inferior ngram-based approach.

https://issues.apache.org/jira/browse/LUCENE-2507

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

On Oct 13, 5:31 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi Sebastian,

Agreed on the terms API, can you open a feature request, I can add it
back. Regarding spell checking, there is very interesting work going on
lucene trunk to have a spell checker that works directly over the actual
index (and no need to build an additional spell check index), so I am
currently leaning towards not having this option, and provide it once that
work is out. I actually started to implement something similar to what is
done in lucene trunk for spellchecking (or more correctly, played with it a
bit), so I might get it in sooner before lucene 4.0 will be out...

-shay.banon

On Wed, Oct 13, 2010 at 7:53 AM, Sebastian sgavar...@gmail.com wrote:

Hi Robert,

Thanks for your reply, see my comments below:

So docfreq() and seek are essentially the same...
Ok, I didn't know the details, in my case either operation is fine, it
would work the same as the current Lucene Spellchecker. What I want to
avoid is a complete search/facet/scoring/hit-collector request if
possible, because of the lack of terms API.

This exist() is mostly used actually when building the spellchecker
index, not for determining an already correct word. Just because the
word is in the index doesn't mean its correct!
I am planning to build the Spellchecker index as a special index
inside ES (with fields analyzed by a variant of ngram), because there
is no other option right now in ES. That's why I need access to
docFreq() to implement exist(), so as not to suggest variants of a
correct word. I think I understand what you mean by "correct word",
but please correct me if I am wrong, in my case the spellchecking is
good enough if it suggests something already contained in the index, I
don't plan to use a word-list or dictionary. I want to avoid an empty
results search, but don't care for perfect vocabulary.

For example, if you are updating the spellcheck index from the new
terms in a lucene index, i think it would be faster to traverse both
TermEnums in parallel to find the new ones that must be added...
otherwise you are essentially seeking to each term (think n log n)
when I think you can get order n instead.
It's a good idea, I'll add it to my wish list. Right now I wasn't
considering incremental updates, but a full reindex from time to time,
so O(n). For traversing TermEnums, either for incremantal updates as
you suggest, or to get the list of terms for a full rebuild, I would
need access to Lucene again, probably the Terms API or something like
that. Maybe the functionality is already provided by an API in ES, if
so please hint me to the docs.

Thanks,
Sebastian.


(Shay Banon) #6

One more thing, I think the count API for your case will be fast enough,
even faster than terms API in certain cases.

On Wed, Oct 13, 2010 at 11:31 AM, Shay Banon
shay.banon@elasticsearch.comwrote:

Hi Sebastian,

Agreed on the terms API, can you open a feature request, I can add it
back. Regarding spell checking, there is very interesting work going on
lucene trunk to have a spell checker that works directly over the actual
index (and no need to build an additional spell check index), so I am
currently leaning towards not having this option, and provide it once that
work is out. I actually started to implement something similar to what is
done in lucene trunk for spellchecking (or more correctly, played with it a
bit), so I might get it in sooner before lucene 4.0 will be out...

-shay.banon

On Wed, Oct 13, 2010 at 7:53 AM, Sebastian sgavarini@gmail.com wrote:

Hi Robert,

Thanks for your reply, see my comments below:

So docfreq() and seek are essentially the same...
Ok, I didn't know the details, in my case either operation is fine, it
would work the same as the current Lucene Spellchecker. What I want to
avoid is a complete search/facet/scoring/hit-collector request if
possible, because of the lack of terms API.

This exist() is mostly used actually when building the spellchecker
index, not for determining an already correct word. Just because the
word is in the index doesn't mean its correct!
I am planning to build the Spellchecker index as a special index
inside ES (with fields analyzed by a variant of ngram), because there
is no other option right now in ES. That's why I need access to
docFreq() to implement exist(), so as not to suggest variants of a
correct word. I think I understand what you mean by "correct word",
but please correct me if I am wrong, in my case the spellchecking is
good enough if it suggests something already contained in the index, I
don't plan to use a word-list or dictionary. I want to avoid an empty
results search, but don't care for perfect vocabulary.

For example, if you are updating the spellcheck index from the new
terms in a lucene index, i think it would be faster to traverse both
TermEnums in parallel to find the new ones that must be added...
otherwise you are essentially seeking to each term (think n log n)
when I think you can get order n instead.
It's a good idea, I'll add it to my wish list. Right now I wasn't
considering incremental updates, but a full reindex from time to time,
so O(n). For traversing TermEnums, either for incremantal updates as
you suggest, or to get the list of terms for a full rebuild, I would
need access to Lucene again, probably the Terms API or something like
that. Maybe the functionality is already provided by an API in ES, if
so please hint me to the docs.

Thanks,
Sebastian.


(Sebastian Gavarini) #7

That's very interesting, I wasn't aware of the new Fuzzy spellchecker
in the next Lucene version. I think that's a better alternative too.
No ticket then for Terms API.

I am worried only about the release date, do you know when is that
spellchecker scheduled for release? Also when is ES going to enable
it?

Thanks,
Sebastian.

On Oct 13, 5:48 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

One more thing, I think the count API for your case will be fast enough,
even faster than terms API in certain cases.

On Wed, Oct 13, 2010 at 11:31 AM, Shay Banon
shay.ba...@elasticsearch.comwrote:

Hi Sebastian,

Agreed on the terms API, can you open a feature request, I can add it
back. Regarding spell checking, there is very interesting work going on
lucene trunk to have a spell checker that works directly over the actual
index (and no need to build an additional spell check index), so I am
currently leaning towards not having this option, and provide it once that
work is out. I actually started to implement something similar to what is
done in lucene trunk for spellchecking (or more correctly, played with it a
bit), so I might get it in sooner before lucene 4.0 will be out...

-shay.banon

On Wed, Oct 13, 2010 at 7:53 AM, Sebastian sgavar...@gmail.com wrote:

Hi Robert,

Thanks for your reply, see my comments below:

So docfreq() and seek are essentially the same...
Ok, I didn't know the details, in my case either operation is fine, it
would work the same as the current Lucene Spellchecker. What I want to
avoid is a complete search/facet/scoring/hit-collector request if
possible, because of the lack of terms API.

This exist() is mostly used actually when building the spellchecker
index, not for determining an already correct word. Just because the
word is in the index doesn't mean its correct!
I am planning to build the Spellchecker index as a special index
inside ES (with fields analyzed by a variant of ngram), because there
is no other option right now in ES. That's why I need access to
docFreq() to implement exist(), so as not to suggest variants of a
correct word. I think I understand what you mean by "correct word",
but please correct me if I am wrong, in my case the spellchecking is
good enough if it suggests something already contained in the index, I
don't plan to use a word-list or dictionary. I want to avoid an empty
results search, but don't care for perfect vocabulary.

For example, if you are updating the spellcheck index from the new
terms in a lucene index, i think it would be faster to traverse both
TermEnums in parallel to find the new ones that must be added...
otherwise you are essentially seeking to each term (think n log n)
when I think you can get order n instead.
It's a good idea, I'll add it to my wish list. Right now I wasn't
considering incremental updates, but a full reindex from time to time,
so O(n). For traversing TermEnums, either for incremantal updates as
you suggest, or to get the list of terms for a full rebuild, I would
need access to Lucene again, probably the Terms API or something like
that. Maybe the functionality is already provided by an API in ES, if
so please hint me to the docs.

Thanks,
Sebastian.


(Shay Banon) #8

Lucene 4.0 release date is not set yet as far as I know. Once its out, it
depends how long it will take to upgrade to 4.0.

-shay.banon

On Thu, Oct 14, 2010 at 5:12 AM, Sebastian sgavarini@gmail.com wrote:

That's very interesting, I wasn't aware of the new Fuzzy spellchecker
in the next Lucene version. I think that's a better alternative too.
No ticket then for Terms API.

I am worried only about the release date, do you know when is that
spellchecker scheduled for release? Also when is ES going to enable
it?

Thanks,
Sebastian.

On Oct 13, 5:48 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

One more thing, I think the count API for your case will be fast enough,
even faster than terms API in certain cases.

On Wed, Oct 13, 2010 at 11:31 AM, Shay Banon
shay.ba...@elasticsearch.comwrote:

Hi Sebastian,

Agreed on the terms API, can you open a feature request, I can add
it

back. Regarding spell checking, there is very interesting work going on
lucene trunk to have a spell checker that works directly over the
actual

index (and no need to build an additional spell check index), so I am
currently leaning towards not having this option, and provide it once
that

work is out. I actually started to implement something similar to what
is

done in lucene trunk for spellchecking (or more correctly, played with
it a

bit), so I might get it in sooner before lucene 4.0 will be out...

-shay.banon

On Wed, Oct 13, 2010 at 7:53 AM, Sebastian sgavar...@gmail.com
wrote:

Hi Robert,

Thanks for your reply, see my comments below:

So docfreq() and seek are essentially the same...
Ok, I didn't know the details, in my case either operation is fine, it
would work the same as the current Lucene Spellchecker. What I want to
avoid is a complete search/facet/scoring/hit-collector request if
possible, because of the lack of terms API.

This exist() is mostly used actually when building the spellchecker
index, not for determining an already correct word. Just because the
word is in the index doesn't mean its correct!
I am planning to build the Spellchecker index as a special index
inside ES (with fields analyzed by a variant of ngram), because there
is no other option right now in ES. That's why I need access to
docFreq() to implement exist(), so as not to suggest variants of a
correct word. I think I understand what you mean by "correct word",
but please correct me if I am wrong, in my case the spellchecking is
good enough if it suggests something already contained in the index, I
don't plan to use a word-list or dictionary. I want to avoid an empty
results search, but don't care for perfect vocabulary.

For example, if you are updating the spellcheck index from the new
terms in a lucene index, i think it would be faster to traverse both
TermEnums in parallel to find the new ones that must be added...
otherwise you are essentially seeking to each term (think n log n)
when I think you can get order n instead.
It's a good idea, I'll add it to my wish list. Right now I wasn't
considering incremental updates, but a full reindex from time to time,
so O(n). For traversing TermEnums, either for incremantal updates as
you suggest, or to get the list of terms for a full rebuild, I would
need access to Lucene again, probably the Terms API or something like
that. Maybe the functionality is already provided by an API in ES, if
so please hint me to the docs.

Thanks,
Sebastian.


(system) #9