Best similarity for retrieving exact matches first

Hi I'm working on a project with elasticseach. I have an index with 1 million phrases inside and I want to retrieve phrases from the index which match with some query phrases. The phrases are in italian and I use an italian anlyzer in order to analyze them. Everything works fine but the problem is in the order (and the score) of the matches: ideally I want to get as the first matches the exact matches of the query phrases but that's not happening.
For example:
searching in my index for phrases containing the words "film cortometraggio" the first match is:
Pappi Corsicato Ha diretto film , cortometraggi , documentari e videoclip.
And then there is the match:
Robinet aviatore Robinet aviatore è un film cortometraggio del 1911 diretto da Luigi Maggi;

In this case the first phrase contains the second word ("cortometraggio") in a plural form, instead the second phrase contains an exact match but the similarity algorithm gives a higher score for the first phrase.
I am using the default BM25 algorithm and I also tried the boolean algorithm but the problem does not solve.
How can I can the similarity measure in order to get the matches in the correct order?

You should combine multiple ways for searching into should clauses of a bool query.

Like run a phrase_match and a match query together.
If the phrase_match query matches, it will contribute to the final score.
You can even boost it.

I have a full example here:

HTH

2 Likes

Hi I found a solution by using this:

GET /research/_search
{
 "query": {
    "bool": {
      "should": [
        {
         `"match_phrase": {
            "frase": {
             "query": "film cortometraggio",
              "slop": 5
            }
          }
        },
        {
          "match_phrase": {
            "frase": "film cortometraggio"
          }
        }
      ]
    }
  }
}

This seems to work fine but since I am using the Java Rest High Level Client I have to "translate" from DSL into Java but I'm having troubles on using the should clause (don't know how to insert the second match_phrase). Do you have any suggestions?
Thank you very much

Please share your current code and we will start from there.

I solved it in this way:

searchSourceBuilder.query(QueryBuilders.boolQuery()
  .should(QueryBuilders.matchPhraseQuery(field, text).slop(5))
  .should(QueryBuilders.matchPhraseQuery(field, text)));

But this only puts as first the exact matches, I also want in order:

  1. the exact matches
  2. phrases that match with the input text (with slop etc)
    3)phrases that match with the results of the stemmed words

So you have 3 conditions here.
But in your query you put only 2.

I believe you need to add another querie(s).

If you don't succeed, please provide a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible.

A full reproduction script is something anyone can copy and paste in Kibana dev console, click on the run button to reproduce your use case. It will help readers to understand, reproduce and if needed fix your problem. It will also most likely help to get a faster answer.

I did not succeed, the problem is that this script works fine if I use the boolean similarity algorithm:

GET /research/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match_phrase": {
            "frase": {
              "query": "film cortometraggio",
              "slop": 5
            }
          }
        },
        {
          "match_phrase": {
            "frase": "film cortometraggio"
          }
        }
      ]
    }
  }
}

But it works fine in this case and not in many others.
For example: if I'm searching for "bobina del film" I'd like to retrieve:

  • An exact match like:
    La bobina contenente il film incompiuto e mai uscito nelle sale venne distrutta durante
    un bombardamento

  • A phrase that is not an exact match but matches the input text like:
    Il film,un cortometraggio in una bobina, fu distribuito dalla General Film Company e uscì in sala

  • A phrase that matches with words resulting from stemming like:
    Il 30 maggio 2013 James Bobin viene scelto come regista del film, il cui titolo di lavorazione è "Alice"

This three cases must be in this order but I can't achieve it. I'm trying with the should-match_phrase combination but I don't know how to specify these three different cases.

Becuase the boolean similarity algorithm does not work fine for me I went back to BM25 but still I have similar problems.

Please invest some time in a full script. As I said:

A full reproduction script is something anyone can copy and paste in Kibana dev console , click on the run button to reproduce your use case.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.