How to configure an index for auto-complete like Google does it


(Jondow) #1

I've done a lot of Googling on the subject, and read numerous posts and
examples on how to setup indexing and search for an auto-complete feature,
much like the behaviour you get with Googles search.

The examples I've found refer to using Edge NGram filters and multi_fields
with one being indexed, and the other not indexed.
Here is one example:
http://jontai.me/blog/2013/02/adding-autocomplete-to-an-elasticsearch-search-application/

These examples make sense but they only seem to work because of the fairly
atomic nature of the fields being indexed. Things like country_name or
author. These work fine because the entire field can be returned as a
reasonable suggestion in an auto-complete. However in my scenario, my
documents have a 'contents' field that has a long description of an item
potentially many paragraphs in length, and I want to be able to identify
documents that contain the entered term, as well as suggestions to
auto-complete based on the words following that in the document.

Alternatively, I even read an example that uses faceted search although
this didn't quite make sense to me as the results I got had no bearing on
the term being entered in the auto-complete box. This example was from the
book ElasticSearch Server found here:

Lastly I found an example using a shingle token filter, but I haven't been
able to get this one working as advertised:
http://developer.rackspace.com/blog/qbox.html

I guess the main issue I have is that the examples all work conveniently
because they target small document fields, not large fields with many
words/tokens. Has anybody been able to replicate the kind of functionality
that Google provides with its own auto-complete, which does give useful
suggestions to auto complete a search even if the word being typed occurs
in a large field? Could you point me in the right direction ito which
tokenizers/filters to use? I'm more than happy to do the digging to figure
it out, but I don't really know where to begin now, given the examples I've
tried so far.

Many thanks,
Darryl Pentz

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(ppearcy) #2

I may be wrong, but google likely is utilizing a list of bite size indexed
fields, however, these are generated via search queries users are actually
searching for and they must keep track of the instances of that search
phrase for sorting.

The best you might be able to do is key phrase extraction, but that is
definitely non-trivial, and you would need to use some various
NRE/linguistics tools (eg, lingpipe, opencalais). YMMV.

Best Regards,
Paul

On Saturday, August 24, 2013 5:08:03 AM UTC-6, Jondow wrote:

I've done a lot of Googling on the subject, and read numerous posts and
examples on how to setup indexing and search for an auto-complete feature,
much like the behaviour you get with Googles search.

The examples I've found refer to using Edge NGram filters and multi_fields
with one being indexed, and the other not indexed.
Here is one example:
http://jontai.me/blog/2013/02/adding-autocomplete-to-an-elasticsearch-search-application/

These examples make sense but they only seem to work because of the fairly
atomic nature of the fields being indexed. Things like country_name or
author. These work fine because the entire field can be returned as a
reasonable suggestion in an auto-complete. However in my scenario, my
documents have a 'contents' field that has a long description of an item
potentially many paragraphs in length, and I want to be able to identify
documents that contain the entered term, as well as suggestions to
auto-complete based on the words following that in the document.

Alternatively, I even read an example that uses faceted search although
this didn't quite make sense to me as the results I got had no bearing on
the term being entered in the auto-complete box. This example was from the
book ElasticSearch Server found here:
http://www.amazon.co.uk/ElasticSearch-Server-R-Kuc/dp/1849518440/ref=sr_1_1?ie=UTF8&qid=1377342391&sr=8-1&keywords=elasticsearch+server

Lastly I found an example using a shingle token filter, but I haven't been
able to get this one working as advertised:
http://developer.rackspace.com/blog/qbox.html

I guess the main issue I have is that the examples all work conveniently
because they target small document fields, not large fields with many
words/tokens. Has anybody been able to replicate the kind of functionality
that Google provides with its own auto-complete, which does give useful
suggestions to auto complete a search even if the word being typed occurs
in a large field? Could you point me in the right direction ito which
tokenizers/filters to use? I'm more than happy to do the digging to figure
it out, but I don't really know where to begin now, given the examples I've
tried so far.

Many thanks,
Darryl Pentz

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Justin-2) #3

Have you checked out this plugin?

I would probably start from there.
On Aug 24, 2013 7:08 AM, "Jondow" djpentz@gmail.com wrote:

I've done a lot of Googling on the subject, and read numerous posts and
examples on how to setup indexing and search for an auto-complete feature,
much like the behaviour you get with Googles search.

The examples I've found refer to using Edge NGram filters and multi_fields
with one being indexed, and the other not indexed.
Here is one example:
http://jontai.me/blog/2013/02/adding-autocomplete-to-an-elasticsearch-search-application/

These examples make sense but they only seem to work because of the fairly
atomic nature of the fields being indexed. Things like country_name or
author. These work fine because the entire field can be returned as a
reasonable suggestion in an auto-complete. However in my scenario, my
documents have a 'contents' field that has a long description of an item
potentially many paragraphs in length, and I want to be able to identify
documents that contain the entered term, as well as suggestions to
auto-complete based on the words following that in the document.

Alternatively, I even read an example that uses faceted search although
this didn't quite make sense to me as the results I got had no bearing on
the term being entered in the auto-complete box. This example was from the
book ElasticSearch Server found here:
http://www.amazon.co.uk/ElasticSearch-Server-R-Kuc/dp/1849518440/ref=sr_1_1?ie=UTF8&qid=1377342391&sr=8-1&keywords=elasticsearch+server

Lastly I found an example using a shingle token filter, but I haven't been
able to get this one working as advertised:
http://developer.rackspace.com/blog/qbox.html

I guess the main issue I have is that the examples all work conveniently
because they target small document fields, not large fields with many
words/tokens. Has anybody been able to replicate the kind of functionality
that Google provides with its own auto-complete, which does give useful
suggestions to auto complete a search even if the word being typed occurs
in a large field? Could you point me in the right direction ito which
tokenizers/filters to use? I'm more than happy to do the digging to figure
it out, but I don't really know where to begin now, given the examples I've
tried so far.

Many thanks,
Darryl Pentz

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Alexander Reelsen) #4

Hey,

sorry to chime in a bit late here, but you may want to check the completion
suggester and the accompanying blog post at

To be honest, I would prefer the completion suggester over the suggest
plugin. However your specific use case might work with the plugin and
shingle configuration better (I need to think this through, if it can work
with the completion suggester like you need to, but dont have the time at
the moment).

--Alex

On Tue, Aug 27, 2013 at 10:25 AM, Justin tcpandip@gmail.com wrote:

Have you checked out this plugin?

https://github.com/spinscale/elasticsearch-suggest-plugin

I would probably start from there.
On Aug 24, 2013 7:08 AM, "Jondow" djpentz@gmail.com wrote:

I've done a lot of Googling on the subject, and read numerous posts and
examples on how to setup indexing and search for an auto-complete feature,
much like the behaviour you get with Googles search.

The examples I've found refer to using Edge NGram filters and
multi_fields with one being indexed, and the other not indexed.
Here is one example:
http://jontai.me/blog/2013/02/adding-autocomplete-to-an-elasticsearch-search-application/

These examples make sense but they only seem to work because of the
fairly atomic nature of the fields being indexed. Things like country_name
or author. These work fine because the entire field can be returned as a
reasonable suggestion in an auto-complete. However in my scenario, my
documents have a 'contents' field that has a long description of an item
potentially many paragraphs in length, and I want to be able to identify
documents that contain the entered term, as well as suggestions to
auto-complete based on the words following that in the document.

Alternatively, I even read an example that uses faceted search although
this didn't quite make sense to me as the results I got had no bearing on
the term being entered in the auto-complete box. This example was from the
book ElasticSearch Server found here:
http://www.amazon.co.uk/ElasticSearch-Server-R-Kuc/dp/1849518440/ref=sr_1_1?ie=UTF8&qid=1377342391&sr=8-1&keywords=elasticsearch+server

Lastly I found an example using a shingle token filter, but I haven't
been able to get this one working as advertised:
http://developer.rackspace.com/blog/qbox.html

I guess the main issue I have is that the examples all work conveniently
because they target small document fields, not large fields with many
words/tokens. Has anybody been able to replicate the kind of functionality
that Google provides with its own auto-complete, which does give useful
suggestions to auto complete a search even if the word being typed occurs
in a large field? Could you point me in the right direction ito which
tokenizers/filters to use? I'm more than happy to do the digging to figure
it out, but I don't really know where to begin now, given the examples I've
tried so far.

Many thanks,
Darryl Pentz

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jondow) #5

Hi Alex,

Thanks for the response. My apologies cos I should have responded to the
thread to indicate I did get a solution working satisfactorily (for now at
least, unless I discover some flaw in it later in the development and
testing).

I found that a shingle filter with a facet search did the trick for me. My
settings JSON is as follows:

"settings": {
    "index": {
        "analysis": {
            "analyzer": {
                "suggestion": {
                    "tokenizer": "standard",
                    "filter": ["lowercase", "suggestion_shingle"]
                }
            },
            "filter": {
                "suggestion_shingle": {
                    "type": "shingle",
                    "min_shingle_size": 2,
                    "max_shingle_size": 5
                },
                "filter_stop": {
                    "type": "stop",
                    "enable_position_increments":"false"
                }
            }
        }
    }
},

My mapping uses a multi_field for the fields I want to query, and it looks
as follows:

            "description": {
                "type": "multi_field",
                "fields": {
                    "description": {
                        "type": "string",
                        "boost": "3.0"
                    },
                    "suggestion": {
                        "type": "string",
                        "boost": "3.0",
                        "index_analyzer": "suggestion",
                        "search_analyzer": "standard"
                    }
                }
            },

Then in my code I do the following search:

    TermsFacetBuilder facetBuilder = new TermsFacetBuilder("desc");
    facetBuilder.field("description.suggestion")
            .regex("^${suffix}.*")
            .size(100);
    request.addFacet(facetBuilder)

'suffix' above is simply whatever the user has typed in. If they however
have typed multiple words like "foo ba" so they're busy typing 'bar', then
I put 'foo' as my search query string, and 'ba' becomes the suffix. If
however they're busy typing 'foo', then I do a match all query, with 'fo'
as my facet query.

So far this is working great. You'll note that the only issue I am having
is that currently with the latest version of ES, you're not allowed to set
the stop filter to enable_postition_increments = false. It throws a stack
trace during indexing. If I don't set that to false, the shingle filter
inserts placeholders (usually a '_' character) where the stop words were.
So I've had to leave that out of the indexing for now, and I still get stop
words in my results, which is fine for now. I don't have a solution for the
stop filter issue though perhaps that'll be addressed sometime soon? But my
current solution is definitely workable and I'm very happy with its
behaviour and performance.

Regards,
Darryl Pentz

On Thu, Sep 12, 2013 at 10:26 AM, Alexander Reelsen alr@spinscale.dewrote:

Hey,

sorry to chime in a bit late here, but you may want to check the
completion suggester and the accompanying blog post at
http://www.elasticsearch.org/blog/you-complete-me/

To be honest, I would prefer the completion suggester over the suggest
plugin. However your specific use case might work with the plugin and
shingle configuration better (I need to think this through, if it can work
with the completion suggester like you need to, but dont have the time at
the moment).

--Alex

On Tue, Aug 27, 2013 at 10:25 AM, Justin tcpandip@gmail.com wrote:

Have you checked out this plugin?

https://github.com/spinscale/elasticsearch-suggest-plugin

I would probably start from there.
On Aug 24, 2013 7:08 AM, "Jondow" djpentz@gmail.com wrote:

I've done a lot of Googling on the subject, and read numerous posts and
examples on how to setup indexing and search for an auto-complete feature,
much like the behaviour you get with Googles search.

The examples I've found refer to using Edge NGram filters and
multi_fields with one being indexed, and the other not indexed.
Here is one example:
http://jontai.me/blog/2013/02/adding-autocomplete-to-an-elasticsearch-search-application/

These examples make sense but they only seem to work because of the
fairly atomic nature of the fields being indexed. Things like country_name
or author. These work fine because the entire field can be returned as a
reasonable suggestion in an auto-complete. However in my scenario, my
documents have a 'contents' field that has a long description of an item
potentially many paragraphs in length, and I want to be able to identify
documents that contain the entered term, as well as suggestions to
auto-complete based on the words following that in the document.

Alternatively, I even read an example that uses faceted search although
this didn't quite make sense to me as the results I got had no bearing on
the term being entered in the auto-complete box. This example was from the
book ElasticSearch Server found here:
http://www.amazon.co.uk/ElasticSearch-Server-R-Kuc/dp/1849518440/ref=sr_1_1?ie=UTF8&qid=1377342391&sr=8-1&keywords=elasticsearch+server

Lastly I found an example using a shingle token filter, but I haven't
been able to get this one working as advertised:
http://developer.rackspace.com/blog/qbox.html

I guess the main issue I have is that the examples all work conveniently
because they target small document fields, not large fields with many
words/tokens. Has anybody been able to replicate the kind of functionality
that Google provides with its own auto-complete, which does give useful
suggestions to auto complete a search even if the word being typed occurs
in a large field? Could you point me in the right direction ito which
tokenizers/filters to use? I'm more than happy to do the digging to figure
it out, but I don't really know where to begin now, given the examples I've
tried so far.

Many thanks,
Darryl Pentz

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/76VIOu32J9Q/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #6