Autocomplete of single words

cardea · September 29, 2015, 10:39am

Hi, everyone,

following problem: I want to have an autocompletion of single words. This means: if I type in "bor", I want to get "boring (num: 29)" and "border (num: 10)" back. "border" and "boring" are parts of several large texts. I do not want to get the whole text, just the number of documents there "boring" and "border" occurs and - of course - the terms "boring" and "border".

I think I have to use https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-completion.html to get this. But all examples are about fulltext-results. So - how do I get he fragments I want?

Thanks and best,
Ernesto

softwaredoug · September 29, 2015, 4:16pm

Do you want a list of snippets where it occurs? Or do you want a breakdown of all the terms in the index that start with bor by count?

The former sounds like a highlighting problem. The latter sounds like a terms aggregration with a prefix filter. I've used the latter for single term autocomplete before. It can work depending on the size of your index and term dictionary (ie number of unique terms).

cardea · September 29, 2015, 5:07pm

I want a list of all the terms in the index starting with bor.
when i have these 6 datasets:
"blah is blubb boring"
"blah is blah"
"boring is border booooo"
"narf is border"
"border is boring"
"blah is border blaaaaaah"
and I search for "bor" I want to have boring: 3, border: 4 as result. My problem is especially how to get the full terms.

davidbkemp · October 2, 2015, 5:44am

Using a terms agg with filtering may satisfy what you need. Note that you need to put the prefix text in both the query and in the agg filter:

{
  "query": {
    "match_phrase_prefix": {
      "name": "bor"
    }
  },
  "size": 0, 
  "aggs": {
    "myagg": {
      "terms": {
        "field": "name",
        "include": "bor.*", 
        "size": 0
      }
    }
  }
}

See this for more details:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_filtering_values

Here is a working shell script (works for Elasticsearch 1.7.2):

curl -XDELETE "http://localhost:9200/foo"

curl -XPUT "http://localhost:9200/foo" -d'
{
  "mappings": {
    "foo": {
      "properties": {
        "name": {
          "type": "string"
        }
      }
    }
  }
}'

curl -XPOST "http://localhost:9200/foo/foo/_bulk" -d'
{"index":{}}
{ "name" : "blah is blubb boring" }
{"index":{}}
{ "name" : "blah is blah" }
{"index":{}}
{ "name" : "boring is border booooo" }
{"index":{}}
{ "name" : "narf is border" }
{"index":{}}
{ "name" : "border is boring" }
{"index":{}}
{ "name" : "blah is border blaaaaaah" }
'

curl -XGET "http://localhost:9200/foo/_refresh"

echo

curl -XGET "http://localhost:9200/foo/foo/_search?pretty=true" -d'
{
  "query": {
    "match_phrase_prefix": {
      "name": "bor"
    }
  },
  "size": 0, 
  "aggs": {
    "myagg": {
      "terms": {
        "field": "name",
        "include": "bor.*", 
        "size": 0
      }
    }
  }
}'

This gives:

"aggregations" : {
    "myagg" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [ {
        "key" : "border",
        "doc_count" : 4
      }, {
        "key" : "boring",
        "doc_count" : 3
      } ]
    }

davidbkemp · October 2, 2015, 5:46am

As follow up: using edge ngrams is likely to be faster than "match_phrase_prefix"

cardea · October 15, 2015, 12:57pm

Late reply, but: thank! After some failures because of bad list definitions in my model it works now - and it's really fast, even with match_phrase_prefix. But I think, I'll test ngrams soon.