[ANN] ElasticSearch Suggest Plugin (using Lucene FSTSuggester)


(Alexander Reelsen) #1

Hi,

I hacked up a little plugin, which uses the Lucene FSTSuggester for
providing suggestions - and only suggestions, no query result data at
all. Basic functionality is like this:

curl -X POST localhost:9200/products/product/_suggest?pretty=1 -d
'{ "field": "ProductName.suggest", term: "product 1" }'
{
"suggestions" : [ "product 1", "product 10", "product 100", "product
1000", "product 101" ],
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}

You can add a size parameter for setting the count of returned
products (default is 10).

You can add a similarity parameter for catching typos (is not set by
default) like this:

curl -X POST localhost:9200/products/product/_suggest?pretty=1 -d
'{ "field": "ProductName.suggest", term: "proudct", similarity: 0.7 }'
{
"suggestions" : [ "product" ],
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}

That's it out of a functionality point of view. It took me some time
to understand how serialization and deserialization of the requests
(and the sharded requests) work, and apart from that I am not a Lucene
expert. So, if you find something to improve, just start bugging me
(there is already quite a long list in the README, what I should do
next).

The project is available at https://github.com/spinscale/elasticsearch-suggest-plugin
and includes a README which should help you to setup suggestion and
the correct index fields. If you come up with better solutions (you
for sure will), drop me a mail. I am not yet using it in production,
this is the reason why I put the big warnings in the readme :slight_smile:

Regards, Alexander


(Shay Banon) #2

Heya,

I had a quick look at the implementation, part of it. First major thing
is that searchers are leaked. When you get a searcher from an index shard,
you need to release it once you are done with it.

Second, the suggest is built on a top level reader, this means it will
be built each time the index gets refreshed with a change. This can become
really heavy. You can enhance it to only built it periodically, but still,
for large enough index, each change to be reflected requires full rebuild
of the suggest. This is, btw, the main problem with the current suggester
in Lucene.

Last, you can build the service as a "shard" level service, and not a
node level service. It will simplify the management of building the suggest
data structure.

-shay.banon

On Sun, Nov 27, 2011 at 3:00 PM, Alexander Reelsen <
alexander.reelsen@googlemail.com> wrote:

Hi,

I hacked up a little plugin, which uses the Lucene FSTSuggester for
providing suggestions - and only suggestions, no query result data at
all. Basic functionality is like this:

curl -X POST localhost:9200/products/product/_suggest?pretty=1 -d
'{ "field": "ProductName.suggest", term: "product 1" }'
{
"suggestions" : [ "product 1", "product 10", "product 100", "product
1000", "product 101" ],
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}

You can add a size parameter for setting the count of returned
products (default is 10).

You can add a similarity parameter for catching typos (is not set by
default) like this:

curl -X POST localhost:9200/products/product/_suggest?pretty=1 -d
'{ "field": "ProductName.suggest", term: "proudct", similarity: 0.7 }'
{
"suggestions" : [ "product" ],
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}

That's it out of a functionality point of view. It took me some time
to understand how serialization and deserialization of the requests
(and the sharded requests) work, and apart from that I am not a Lucene
expert. So, if you find something to improve, just start bugging me
(there is already quite a long list in the README, what I should do
next).

The project is available at
https://github.com/spinscale/elasticsearch-suggest-plugin
and includes a README which should help you to setup suggestion and
the correct index fields. If you come up with better solutions (you
for sure will), drop me a mail. I am not yet using it in production,
this is the reason why I put the big warnings in the readme :slight_smile:

Regards, Alexander


(Alexander Reelsen) #3

Hi Shay

On 29 Nov., 16:26, Shay Banon kim...@gmail.com wrote:

I had a quick look at the implementation, part of it. First major thing
is that searchers are leaked. When you get a searcher from an index shard,
you need to release it once you are done with it.
What do you mean with "release"? Not holding a reference to it or
performing a special action? (dumb question I guess :slight_smile:

Second, the suggest is built on a top level reader, this means it will
be built each time the index gets refreshed with a change. This can become
really heavy. You can enhance it to only built it periodically, but still,
for large enough index, each change to be reflected requires full rebuild
of the suggest. This is, btw, the main problem with the current suggester
in Lucene.
True. I should implement some periodic update, which is configurable I
guess.

Last, you can build the service as a "shard" level service, and not a
node level service. It will simplify the management of building the suggest
data structure.
Neat, I didnt know that. Any service which does this exactly, where I
can take a look?

Thanks for taking a look at it.
Also, thanks for the TransportAction tip - will keep my weekend
alive :wink:

--Alexander


(Shay Banon) #4

On Tue, Nov 29, 2011 at 6:09 PM, Alexander Reelsen <
alexander.reelsen@googlemail.com> wrote:

Hi Shay

On 29 Nov., 16:26, Shay Banon kim...@gmail.com wrote:

I had a quick look at the implementation, part of it. First major
thing
is that searchers are leaked. When you get a searcher from an index
shard,
you need to release it once you are done with it.
What do you mean with "release"? Not holding a reference to it or
performing a special action? (dumb question I guess :slight_smile:

When you get a searcher from an index shard, you get back Engine.Searcher,
which has a release method on it that you should call (in a finally clause).

Second, the suggest is built on a top level reader, this means it will
be built each time the index gets refreshed with a change. This can
become
really heavy. You can enhance it to only built it periodically, but
still,
for large enough index, each change to be reflected requires full rebuild
of the suggest. This is, btw, the main problem with the current suggester
in Lucene.
True. I should implement some periodic update, which is configurable I
guess.

Last, you can build the service as a "shard" level service, and not a
node level service. It will simplify the management of building the
suggest
data structure.
Neat, I didnt know that. Any service which does this exactly, where I
can take a look?

Check the ShardGetService, can be a good place to start.

Thanks for taking a look at it.
Also, thanks for the TransportAction tip - will keep my weekend
alive :wink:

--Alexander


(system) #5