Question about backward matching query

Hi,

Indexed data in ES

id(text) | words(keyword) 
A | "1 2 3 4 100 101 102 103 104 105"
B | "1 2 3 100 101 102 103 104 105"
C | "1 2 100"

Search query

"bool": { 
  "should": [
    { "term": {"words":"1"}},
    { "term": {"words":"2"}},
    { "term": {"words":"3"}},
    { "term": {"words":"4"}}
  ]
},
"_source": ["id"],
"size":1,
  "sort": [
    {
      "_score": {
        "order": "desc"
      }
    }
  ]

Forward/Backward matching

Source words: 15
Target words: 10
Matchings: 5
Forward Matching Rate: 33%(5/15)
Backward Matching Rate: 50%(5/10)

The search query returns an id that has the highest forward matching score. So the result of this query will be "id:A" because A has all of 1,2,3 and 4, and its forward matching rate is 4/4. But its backward matching rate is 4/10.

What I am trying to do is getting an id that has the highest backward matching rate. In other words, I am looking for "id:C" as its backward matching is 2/3 which is the highest backward matching rate.

Does anyone know how to do it?

It's a tricky ask since inverted indices don't quite work in this manner, so it's not easy to express with a simple query.

The simplest way to do something like this is to include another field which is the length of the first field. Then you can use that to help score the match, since you can check for matches in words, but also adjust the score because the length_words is 10 and you're only asking for 4 terms.

It's not perfect (doesn't help with duplicates), and it's also finicky to get working right.

You could probably cobble something together with a shingle tokenizer too, and look for exact shingle matches. Or ngrams.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.