Best similarity for retrieving exact matches first

Giovanni_Montobbio · November 3, 2021, 11:25am

Hi I'm working on a project with elasticseach. I have an index with 1 million phrases inside and I want to retrieve phrases from the index which match with some query phrases. The phrases are in italian and I use an italian anlyzer in order to analyze them. Everything works fine but the problem is in the order (and the score) of the matches: ideally I want to get as the first matches the exact matches of the query phrases but that's not happening.
For example:
searching in my index for phrases containing the words "film cortometraggio" the first match is:
Pappi Corsicato Ha diretto film , cortometraggi , documentari e videoclip.
And then there is the match:
Robinet aviatore Robinet aviatore è un film cortometraggio del 1911 diretto da Luigi Maggi;

In this case the first phrase contains the second word ("cortometraggio") in a plural form, instead the second phrase contains an exact match but the similarity algorithm gives a higher score for the first phrase.
I am using the default BM25 algorithm and I also tried the boolean algorithm but the problem does not solve.
How can I can the similarity measure in order to get the matches in the correct order?

dadoonet · November 3, 2021, 1:38pm

You should combine multiple ways for searching into should clauses of a bool query.

Like run a phrase_match and a match query together.
If the phrase_match query matches, it will contribute to the final score.
You can even boost it.

I have a full example here:

gist.github.com

https://gist.github.com/dadoonet/5179ee72ecbf08f12f53d4bda1b76bab

search_kibana_console.txt

### REINIT
DELETE user
PUT user
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text"
      },
      "comments": {

This file has been truncated. show original

HTH

Giovanni_Montobbio · November 3, 2021, 2:06pm

Hi I found a solution by using this:

GET /research/_search
{
 "query": {
    "bool": {
      "should": [
        {
         `"match_phrase": {
            "frase": {
             "query": "film cortometraggio",
              "slop": 5
            }
          }
        },
        {
          "match_phrase": {
            "frase": "film cortometraggio"
          }
        }
      ]
    }
  }
}

This seems to work fine but since I am using the Java Rest High Level Client I have to "translate" from DSL into Java but I'm having troubles on using the should clause (don't know how to insert the second match_phrase). Do you have any suggestions?
Thank you very much

dadoonet · November 3, 2021, 2:29pm

Please share your current code and we will start from there.

Giovanni_Montobbio · November 3, 2021, 3:23pm

I solved it in this way:

searchSourceBuilder.query(QueryBuilders.boolQuery()
  .should(QueryBuilders.matchPhraseQuery(field, text).slop(5))
  .should(QueryBuilders.matchPhraseQuery(field, text)));

But this only puts as first the exact matches, I also want in order:

the exact matches
phrases that match with the input text (with slop etc)
3)phrases that match with the results of the stemmed words

dadoonet · November 3, 2021, 4:01pm

So you have 3 conditions here.
But in your query you put only 2.

I believe you need to add another querie(s).

If you don't succeed, please provide a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible.

A full reproduction script is something anyone can copy and paste in Kibana dev console, click on the run button to reproduce your use case. It will help readers to understand, reproduce and if needed fix your problem. It will also most likely help to get a faster answer.

Giovanni_Montobbio · November 3, 2021, 4:20pm

I did not succeed, the problem is that this script works fine if I use the boolean similarity algorithm:

GET /research/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match_phrase": {
            "frase": {
              "query": "film cortometraggio",
              "slop": 5
            }
          }
        },
        {
          "match_phrase": {
            "frase": "film cortometraggio"
          }
        }
      ]
    }
  }
}

But it works fine in this case and not in many others.
For example: if I'm searching for "bobina del film" I'd like to retrieve:

An exact match like:
La bobina contenente il film incompiuto e mai uscito nelle sale venne distrutta durante
un bombardamento
A phrase that is not an exact match but matches the input text like:
Il film,un cortometraggio in una bobina, fu distribuito dalla General Film Company e uscì in sala
A phrase that matches with words resulting from stemming like:
Il 30 maggio 2013 James Bobin viene scelto come regista del film, il cui titolo di lavorazione è "Alice"

This three cases must be in this order but I can't achieve it. I'm trying with the should-match_phrase combination but I don't know how to specify these three different cases.

Becuase the boolean similarity algorithm does not work fine for me I went back to BM25 but still I have similar problems.

dadoonet · November 3, 2021, 7:04pm

Please invest some time in a full script. As I said:

A full reproduction script is something anyone can copy and paste in Kibana dev console , click on the run button to reproduce your use case.

system · December 1, 2021, 7:04pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Partial phrase or exact phrase matching Elasticsearch	10	7302	August 20, 2020
Sort by position / best match Elasticsearch	6	4236	December 27, 2017
How to increase score matching phrase to the text from where phrase was selected? Elasticsearch	2	391	December 12, 2018
Exact phrase matching in any word order, but restricts words not in query Elasticsearch	3	1104	July 6, 2017
Text_phrase_prefix scoring and closest match Elasticsearch	3	1013	July 6, 2017

Best similarity for retrieving exact matches first

Related topics