Search_as_you_type usage

CSharpBender · May 7, 2020, 10:03am

Hello,

I'm trying to make a search that orders the results by:
-exact match
-starts with
-phrase match
-exact words
-fuzzy

I'm using Elasticsearch 7.6.1 and after trying to make my own query I noticed that there is a search_as_you_type but without documentation on how to use it.
First of all I would like to know what's in those hidden fields that search_as_you_type is creating? Is there a way see how it's tokenizing my text?

This is an example of what I'm trying to achieve:

PUT test
{
  "mappings": {
    "properties": {
      "title": {
        "type": "search_as_you_type"
      }
    }
  }
}

#Insert some song titles
PUT test/_doc/1?refresh
{
  "title": "Dream"
}
#Dream
#Just A Dream
#Dreamer
#Balladream
#Dress Blues
#Children at play
#You Are My Dream
#Calling Dr. Dre
#Still Dre

#Search for 'Dre'
GET test/_search
{
  "query": {
    "multi_match": {
      "query": "dre",
      "type": "bool_prefix",
      "fuzziness": 2,
      "fields": [
        "title",
        "title._2gram",
        "title._3gram",
        "title._index_prefix"
      ]
    }
  }
}

The results are showing the same score (1) and the search is not working with contains.
So if I search for "dream" it won't show the "Balladream" result and the order is a mess.

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "title" : "Dream"
        }
      },
      {
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "title" : "Just A Dream"
        }
      },
      {
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "6",
        "_score" : 1.0,
        "_source" : {
          "title" : "You Are My Dream"
        },
       {
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "9",
        "_score" : 1.0,
        "_source" : {
          "title" : "Dreamer"
        }      
      }
    ]
  }
}

Any idea on how I can make my search?

cbuescher · May 7, 2020, 12:10pm

Hi,

One possible way to check this is to use the _termvectors API that tells you about index terms for a specific document. For the ""Just A Dream" doc for example I get:

{
  "_index" : "my_index",
  "_type" : "_doc",
  "_id" : "2",
  "_version" : 1,
  "found" : true,
  "took" : 1,
  "term_vectors" : {
    "my_field._3gram" : {
      "field_statistics" : {
        "sum_doc_freq" : 5,
        "doc_count" : 4,
        "sum_ttf" : 5
      },
      "terms" : {
        "just a dream" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 12
            }
          ]
        }
      }
    },
    "my_field._2gram" : {
      "field_statistics" : {
        "sum_doc_freq" : 11,
        "doc_count" : 6,
        "sum_ttf" : 11
      },
      "terms" : {
        "a dream" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 5,
              "end_offset" : 12
            }
          ]
        },
        "just a" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 6
            }
          ]
        }
      }
    },
    "my_field._index_prefix" : {
      "field_statistics" : {
        "sum_doc_freq" : 181,
        "doc_count" : 9,
        "sum_ttf" : 183
      },
      "terms" : {
        "a" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 5,
              "end_offset" : 12
            }
          ]
        },
        "a " : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 5,
              "end_offset" : 12
            }
          ]
        },
        "a d" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 5,
              "end_offset" : 12
            }
          ]
        },
        "a dr" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 5,
              "end_offset" : 12
            }
          ]
        },
        "a dre" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 5,
              "end_offset" : 12
            }
          ]
        },
        "a drea" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 5,
              "end_offset" : 12
            }
          ]
        },
        "a dream" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 5,
              "end_offset" : 12
            }
          ]
        },
        "a dream " : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 5,
              "end_offset" : 12
            }
          ]
        },
        "d" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 2,
              "start_offset" : 7,
              "end_offset" : 12
            }
          ]
        },
        "dr" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 2,
              "start_offset" : 7,
              "end_offset" : 12
            }
          ]
        },
        "dre" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 2,
              "start_offset" : 7,
              "end_offset" : 12
            }
          ]
        },
        "drea" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 2,
              "start_offset" : 7,
              "end_offset" : 12
            }
          ]
        },
        "dream" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 2,
              "start_offset" : 7,
              "end_offset" : 12
            }
          ]
        },
        "dream " : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 2,
              "start_offset" : 7,
              "end_offset" : 12
            }
          ]
        },
        "dream  " : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 2,
              "start_offset" : 7,
              "end_offset" : 12
            }
          ]
        },
        "j" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 12
            }
          ]
        },
        "ju" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 12
            }
          ]
        },
        "jus" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 12
            }
          ]
        },
        "just" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 12
            }
          ]
        },
        "just " : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 12
            }
          ]
        },
        "just a" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 12
            }
          ]
        },
        "just a " : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 12
            }
          ]
        },
        "just a d" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 12
            }
          ]
        },
        "just a dr" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 12
            }
          ]
        },
        "just a dre" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 12
            }
          ]
        },
        "just a drea" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 12
            }
          ]
        },
        "just a dream" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 12
            }
          ]
        }
      }
    },
    "my_field" : {
      "field_statistics" : {
        "sum_doc_freq" : 20,
        "doc_count" : 9,
        "sum_ttf" : 20
      },
      "terms" : {
        "a" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 5,
              "end_offset" : 6
            }
          ]
        },
        "dream" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 2,
              "start_offset" : 7,
              "end_offset" : 12
            }
          ]
        },
        "just" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 4
            }
          ]
        }
      }
    }
  }
}

In order to include e.g. exact matches and rank them higher, you can probably explore ways of combining several queries in a bool query should clause or something the like. Getting exactly the ranking you proposed above might be a bit twiddly but for example this query gives me exact "dre" matches scored higher than the prefix matches alone:

GET my_index/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "multi_match": {
            "query": "dre",
            "type": "bool_prefix",
            "fuzziness": 2, 
            "fields": [
              "my_field",
              "my_field._2gram",
              "my_field._3gram",
              "my_field._index_prefix"
            ]
          }
        },
        {
          "match_phrase": {
            "my_field": "dre"
          }
        }
      ]
    }
  }
}

About the possibility of "infix" matches like returning "Balladream" for "dream" I'm not entirely sure, the documentation mentions something along those lines but is short on details. Will do some digging, while I hope the above pointers will bring you a bit further towards your desired goal.

cbuescher · May 7, 2020, 1:13pm

Regarding your question around returning results that partially match inside a token: unfortunately everything is term-based here, so you can return foo bar if you search for ba, but currently that’s not possible to have the infix match at the term level since we only index prefixes. It would also be very costly operation given how the datastructures work internally. One workaround might be to take a look at decompounders to split dreams of from Balladreams but I'm not sure if thats working in a scalable and reliant way. Sorry for that, but I hope the rest helps.

CSharpBender · May 9, 2020, 11:40am

Hi @cbuescher , thanks for replying.
I was expecting more from this new type.
Currently I'm using a phrase_match_query with slop 2 over a standard tokenizer+edgengram+lowercase filter for filtering and a should with 3 conditions with different boost: exact expression match (lowercase keyword), starts with (lowercase keyword) and match on words (standard tokenizer+lowercase filter).
The only issue with my approach is that it's a little slow (~120ms for 100 returned items from 1milion records) and it's not addressing the contains.

Off topic: I have a typeahead control that searches through one document field values that contain duplicates (Ex: artist name). I tried to execute the above query, apply distinct aggregation on the results and order them by max score but it doubled my execution time, it's faster if I do it from Sql Server.
Maybe I'm doing it wrong and I shouldn't use script:

"aggs": {
        "group_by_title": {
          "terms": {
            "field": "title",
            "order": {
              "max_score": "desc"
            }
          },
          "aggs": {
            "max_score": {
              "max": {
                "script": "doc.score"
              }
            }
          }
    }

I think it would be better to have another index with those distinct values and query on those few documents. Something like: _id, columnName, columnValue; where columnName is the document field on which I want to search.
Is there a way to automatically extract the field distinct values? I was looking at transform API but it's unclear how I should insert/update/delete distinct values when one document changes in another index.

system · June 6, 2020, 11:40am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Search as you type not returning expected result Elasticsearch	1	689	July 5, 2017
Unexpected behaviour with search_as_you_type field indexed as arrays Elasticsearch	1	277	November 27, 2020
Can search_as_you_type field show the prefix match results firstly Elasticsearch	1	226	August 4, 2022
Search as you Type field functionality Elasticsearch	1	361	April 17, 2020
Search-as-you-type datatype + text array = not working Elasticsearch	1	647	February 18, 2020

Search_as_you_type usage

Related topics