Take account of repeat words (duplicate words)

Tell me, how can uchitvat number of occurrences of the term in the search?

Index

{
  "mappings": {
    "properties": {
      "text": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  }
}

Docs:

ID 1: {"text": "foo"}
ID 2: {"text": "test foo repeat foo"}

Request:

{
  "size": 5,
  "query": {
    "multi_match": {
      "fields": [
        "text"
      ],
      "query": "foo foo",
      "analyzer": "whitespace",
      "minimum_should_match": "100%",
      "operator": "and"
    }
  },
  "explain": true
}

Result:

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0.48326197,
    "hits": [
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.48326197,
        "_source": {
          "text": "foo"
        }
      },
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "2",
        "_score": 0.4289919,
        "_source": {
          "text": "test foo repeat foo"
        }
      }
    ]
  }
}

The word "foo" is in both documents.
In the request, it occurs 2 times - then document ID 2 should have more _score, since the word is also repeated twice in it.

The above example is simplified. I use similarity = boolean, ngramm analyzer, fuzzy and more complex queries. But the situation is similar: elasticsearch does not handle word repetitions (duplicates). If the term is in the document, then it is taken into account for all the searched words.

Ideally, it would disable the terms that have already given search results so that the following search words are not matched with this term ))

It would be possible to solve the problem through "Scripted similarity" but it is forbidden!

{
  "settings": {
    "similarity": {
      "search_similarity" : {
        "type": "scripted",
        "script": {
          "source": "return query.boost / doc.freq;"
        }
      }
    }
..................................

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.