Search_as_you_type usage

Hello,

I'm trying to make a search that orders the results by:
-exact match
-starts with
-phrase match
-exact words
-fuzzy

I'm using Elasticsearch 7.6.1 and after trying to make my own query I noticed that there is a search_as_you_type but without documentation on how to use it.
First of all I would like to know what's in those hidden fields that search_as_you_type is creating? Is there a way see how it's tokenizing my text?

This is an example of what I'm trying to achieve:

PUT test
{
  "mappings": {
    "properties": {
      "title": {
        "type": "search_as_you_type"
      }
    }
  }
}

#Insert some song titles
PUT test/_doc/1?refresh
{
  "title": "Dream"
}
#Dream
#Just A Dream
#Dreamer
#Balladream
#Dress Blues
#Children at play
#You Are My Dream
#Calling Dr. Dre
#Still Dre

#Search for 'Dre'
GET test/_search
{
  "query": {
    "multi_match": {
      "query": "dre",
      "type": "bool_prefix",
      "fuzziness": 2,
      "fields": [
        "title",
        "title._2gram",
        "title._3gram",
        "title._index_prefix"
      ]
    }
  }
}

The results are showing the same score (1) and the search is not working with contains.
So if I search for "dream" it won't show the "Balladream" result and the order is a mess.

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "title" : "Dream"
        }
      },
      {
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "title" : "Just A Dream"
        }
      },
      {
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "6",
        "_score" : 1.0,
        "_source" : {
          "title" : "You Are My Dream"
        },
       {
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "9",
        "_score" : 1.0,
        "_source" : {
          "title" : "Dreamer"
        }      
      }
    ]
  }
}

Any idea on how I can make my search?

Hi,

One possible way to check this is to use the _termvectors API that tells you about index terms for a specific document. For the ""Just A Dream" doc for example I get:

{
  "_index" : "my_index",
  "_type" : "_doc",
  "_id" : "2",
  "_version" : 1,
  "found" : true,
  "took" : 1,
  "term_vectors" : {
    "my_field._3gram" : {
      "field_statistics" : {
        "sum_doc_freq" : 5,
        "doc_count" : 4,
        "sum_ttf" : 5
      },
      "terms" : {
        "just a dream" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 12
            }
          ]
        }
      }
    },
    "my_field._2gram" : {
      "field_statistics" : {
        "sum_doc_freq" : 11,
        "doc_count" : 6,
        "sum_ttf" : 11
      },
      "terms" : {
        "a dream" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 5,
              "end_offset" : 12
            }
          ]
        },
        "just a" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 6
            }
          ]
        }
      }
    },
    "my_field._index_prefix" : {
      "field_statistics" : {
        "sum_doc_freq" : 181,
        "doc_count" : 9,
        "sum_ttf" : 183
      },
      "terms" : {
        "a" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 5,
              "end_offset" : 12
            }
          ]
        },
        "a " : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 5,
              "end_offset" : 12
            }
          ]
        },
        "a d" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 5,
              "end_offset" : 12
            }
          ]
        },
        "a dr" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 5,
              "end_offset" : 12
            }
          ]
        },
        "a dre" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 5,
              "end_offset" : 12
            }
          ]
        },
        "a drea" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 5,
              "end_offset" : 12
            }
          ]
        },
        "a dream" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 5,
              "end_offset" : 12
            }
          ]
        },
        "a dream " : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 5,
              "end_offset" : 12
            }
          ]
        },
        "d" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 2,
              "start_offset" : 7,
              "end_offset" : 12
            }
          ]
        },
        "dr" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 2,
              "start_offset" : 7,
              "end_offset" : 12
            }
          ]
        },
        "dre" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 2,
              "start_offset" : 7,
              "end_offset" : 12
            }
          ]
        },
        "drea" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 2,
              "start_offset" : 7,
              "end_offset" : 12
            }
          ]
        },
        "dream" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 2,
              "start_offset" : 7,
              "end_offset" : 12
            }
          ]
        },
        "dream " : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 2,
              "start_offset" : 7,
              "end_offset" : 12
            }
          ]
        },
        "dream  " : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 2,
              "start_offset" : 7,
              "end_offset" : 12
            }
          ]
        },
        "j" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 12
            }
          ]
        },
        "ju" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 12
            }
          ]
        },
        "jus" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 12
            }
          ]
        },
        "just" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 12
            }
          ]
        },
        "just " : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 12
            }
          ]
        },
        "just a" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 12
            }
          ]
        },
        "just a " : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 12
            }
          ]
        },
        "just a d" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 12
            }
          ]
        },
        "just a dr" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 12
            }
          ]
        },
        "just a dre" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 12
            }
          ]
        },
        "just a drea" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 12
            }
          ]
        },
        "just a dream" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 12
            }
          ]
        }
      }
    },
    "my_field" : {
      "field_statistics" : {
        "sum_doc_freq" : 20,
        "doc_count" : 9,
        "sum_ttf" : 20
      },
      "terms" : {
        "a" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 5,
              "end_offset" : 6
            }
          ]
        },
        "dream" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 2,
              "start_offset" : 7,
              "end_offset" : 12
            }
          ]
        },
        "just" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 4
            }
          ]
        }
      }
    }
  }
}

In order to include e.g. exact matches and rank them higher, you can probably explore ways of combining several queries in a bool query should clause or something the like. Getting exactly the ranking you proposed above might be a bit twiddly but for example this query gives me exact "dre" matches scored higher than the prefix matches alone:

GET my_index/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "multi_match": {
            "query": "dre",
            "type": "bool_prefix",
            "fuzziness": 2, 
            "fields": [
              "my_field",
              "my_field._2gram",
              "my_field._3gram",
              "my_field._index_prefix"
            ]
          }
        },
        {
          "match_phrase": {
            "my_field": "dre"
          }
        }
      ]
    }
  }
}

About the possibility of "infix" matches like returning "Balladream" for "dream" I'm not entirely sure, the documentation mentions something along those lines but is short on details. Will do some digging, while I hope the above pointers will bring you a bit further towards your desired goal.

Regarding your question around returning results that partially match inside a token: unfortunately everything is term-based here, so you can return foo bar if you search for ba, but currently that’s not possible to have the infix match at the term level since we only index prefixes. It would also be very costly operation given how the datastructures work internally. One workaround might be to take a look at decompounders to split dreams of from Balladreams but I'm not sure if thats working in a scalable and reliant way. Sorry for that, but I hope the rest helps.

Hi @cbuescher , thanks for replying.
I was expecting more from this new type.
Currently I'm using a phrase_match_query with slop 2 over a standard tokenizer+edgengram+lowercase filter for filtering and a should with 3 conditions with different boost: exact expression match (lowercase keyword), starts with (lowercase keyword) and match on words (standard tokenizer+lowercase filter).
The only issue with my approach is that it's a little slow (~120ms for 100 returned items from 1milion records) and it's not addressing the contains.

Off topic: I have a typeahead control that searches through one document field values that contain duplicates (Ex: artist name). I tried to execute the above query, apply distinct aggregation on the results and order them by max score but it doubled my execution time, it's faster if I do it from Sql Server.
Maybe I'm doing it wrong and I shouldn't use script:

"aggs": {
        "group_by_title": {
          "terms": {
            "field": "title",
            "order": {
              "max_score": "desc"
            }
          },
          "aggs": {
            "max_score": {
              "max": {
                "script": "doc.score"
              }
            }
          }
    }

I think it would be better to have another index with those distinct values and query on those few documents. Something like: _id, columnName, columnValue; where columnName is the document field on which I want to search.
Is there a way to automatically extract the field distinct values? I was looking at transform API but it's unclear how I should insert/update/delete distinct values when one document changes in another index.