Multi_match with phrase_prefix is not working although a token has the prefix in it

Hello,

I am facing a strange situation with multi_match.
This is the simplified multi_match query I am trying

{
    "multi_match": {
        "query": "202201",
        "type": "phrase_prefix",
        "fields": [
          "file_name"
        ]
      }
}

If I query with 202201 then I see 20220101_Legal Document_5678.pdf from the result. but if I query with 2022 I don't see the file any longer.

So I analyzed the file_name 20220101_Legal Document_5678.pdf and here's the tokens

{
  "tokens": [
    {
      "token": "20220101_legal",
      "start_offset": 0,
      "end_offset": 14,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "document_5678",
      "start_offset": 15,
      "end_offset": 28,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "pdf",
      "start_offset": 29,
      "end_offset": 32,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

as you see the first token starts with 2022 so it should be matched to the query with 2022. shouldn't it?
But the result of the query with 2022 doesn't include that file. Can someone explain the reason?

Thank you for your help in advance!
Best,
Ansun

What version of Elasticsearch are you using?

In 8.14.1, I can't reproduce this, because the following snippet:

POST test/_doc
{
  "file_name": "20220101_Legal Document_5678.pdf"
}

POST test/_search
{
  "query": {
    "multi_match": {
      "query": "2022",
      "type": "phrase_prefix", 
      "fields": ["file_name"]
    }
  }
}

will return the expected result. Even 20 and 2 returns the expected result.

It would be helpful to provide more information - perhaps the mappings, more specific query, and version to start, and whether you still return that just further down the result set that you expect.

1 Like

It's because of the max_expansions value. By default it's set to 50. So when you send query with longer characters (eg. "202201") you will see the results. You can increase the max_expansion value the but it can hurt the performance.

See my screenshot, both queries are showing the results because I have only one doc in test index.

See the notes from official documentation: Match phrase prefix query | Elasticsearch Guide [8.14] | Elastic

As a workaround you can choose one of the followings.

  1. Increase the max_expansion value in your query.
  2. Use edge_ngram tokenizer - This will also tune the query speed. (recommended)
  3. Use prefix query - there is no max_expansion limit for the prefix query but it can be slower than phrase_prefix.
1 Like

Hi Kathleen,

Thank you for your reply!
I am using ES 7.13.2.

The mapping for file_name is

{
  "file_name": {
    "type": "text",
    "fields": {
      "keyword": {
        "type": "keyword",
        "ignore_above": 256
      }
    }
  }
}

Hi Musab,

Thank you for your answer! Increasing max_expansions to 100 resolved the problem. But do you have an idea why query with 2022 doesn't show the file 20220101_Legal Document_5678.pdf? From what I understood, the filename has the exact match with the prefix 2022, so it doesn't even need to expand the query.

Or Does ES expand the query to compare to the whole token (20220101_legal ) ?

Just want to answer to my question above. Match phrase prefix query | Elasticsearch Guide [8.14] | Elastic already explains how it works. so ES expands the phrase in the query with the suggestion it fuzzies. so it means we need 20220101_Legal to be generated with the expansion.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.