Autocomplete of single words


(Ernesto Ruge) #1

Hi, everyone,

following problem: I want to have an autocompletion of single words. This means: if I type in "bor", I want to get "boring (num: 29)" and "border (num: 10)" back. "border" and "boring" are parts of several large texts. I do not want to get the whole text, just the number of documents there "boring" and "border" occurs and - of course - the terms "boring" and "border".

I think I have to use https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-completion.html to get this. But all examples are about fulltext-results. So - how do I get he fragments I want?

Thanks and best,
Ernesto


(Doug Turnbull) #2

Do you want a list of snippets where it occurs? Or do you want a breakdown of all the terms in the index that start with bor by count?

The former sounds like a highlighting problem. The latter sounds like a terms aggregration with a prefix filter. I've used the latter for single term autocomplete before. It can work depending on the size of your index and term dictionary (ie number of unique terms).


(Ernesto Ruge) #3

I want a list of all the terms in the index starting with bor.
when i have these 6 datasets:
"blah is blubb boring"
"blah is blah"
"boring is border booooo"
"narf is border"
"border is boring"
"blah is border blaaaaaah"
and I search for "bor" I want to have boring: 3, border: 4 as result. My problem is especially how to get the full terms.


(David Kemp) #4

Using a terms agg with filtering may satisfy what you need. Note that you need to put the prefix text in both the query and in the agg filter:

{
  "query": {
    "match_phrase_prefix": {
      "name": "bor"
    }
  },
  "size": 0, 
  "aggs": {
    "myagg": {
      "terms": {
        "field": "name",
        "include": "bor.*", 
        "size": 0
      }
    }
  }
}

See this for more details:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_filtering_values

Here is a working shell script (works for Elasticsearch 1.7.2):

curl -XDELETE "http://localhost:9200/foo"

curl -XPUT "http://localhost:9200/foo" -d'
{
  "mappings": {
    "foo": {
      "properties": {
        "name": {
          "type": "string"
        }
      }
    }
  }
}'

curl -XPOST "http://localhost:9200/foo/foo/_bulk" -d'
{"index":{}}
{ "name" : "blah is blubb boring" }
{"index":{}}
{ "name" : "blah is blah" }
{"index":{}}
{ "name" : "boring is border booooo" }
{"index":{}}
{ "name" : "narf is border" }
{"index":{}}
{ "name" : "border is boring" }
{"index":{}}
{ "name" : "blah is border blaaaaaah" }
'

curl -XGET "http://localhost:9200/foo/_refresh"

echo

curl -XGET "http://localhost:9200/foo/foo/_search?pretty=true" -d'
{
  "query": {
    "match_phrase_prefix": {
      "name": "bor"
    }
  },
  "size": 0, 
  "aggs": {
    "myagg": {
      "terms": {
        "field": "name",
        "include": "bor.*", 
        "size": 0
      }
    }
  }
}'

This gives:

"aggregations" : {
    "myagg" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [ {
        "key" : "border",
        "doc_count" : 4
      }, {
        "key" : "boring",
        "doc_count" : 3
      } ]
    }

(David Kemp) #5

As follow up: using edge ngrams is likely to be faster than "match_phrase_prefix"


(Ernesto Ruge) #6

Late reply, but: thank! After some failures because of bad list definitions in my model it works now - and it's really fast, even with match_phrase_prefix. :slight_smile: But I think, I'll test ngrams soon.


(system) #7