Keyword Array Field Removes Duplicates in Queries

aeroplane380 · February 6, 2024, 1:27am

This post relates to a problem I have encountered in my business production database, but will be described with a minimal reproducible example.

I have an index with the following mapping:

{
  "properties": {
    "test_field": {
      "type": "keyword"
    }
  }
}

I have two documents in the index:

{"test_field": ["good", "good", "good", "good"]}
{"test_field": ["good", "good"]}

I am trying to perform a search on terms in this keyword field. When I search for good, I want the first document to have a higher score because there are more matching terms.

As far as I understand, norms are disabled by default for keyword fields so this shouldn't affect the scoring. When I perform a multi_match query on this field (there are more fields in the query in my production database), both of these documents receive the same score. As the following excerpt shows, freq is computed as 1.

{
  "value": 0.45454544,
  "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
  "details": [{
    "value": 1,
    "description": "freq, occurrences of term within document",
    "details": []
  },
  ...

After investigation, it appears that the query thinks there is only one element in the field, as shown by this test query:

{
  "query": {
    "bool": {
      "must": [{
        "script_score": {
          "query": {
            "multi_match": {
              "query": "word",
              "fields": ["test_field"]
            }
          },
          "script": {
            "source": "return doc.test_field.size();"
          }
        }
      }]
    }
  }, 
  "explain": true
}

which returns the following

{
  "took": 39,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1.0,
    "hits": [
      {
        "_shard": "[test-index][0]",
        "_node": "pJAjqOHJSP2UgyvCe_m_8g",
        "_index": "test-index",
        "_id": "EuLve40BRTvtv3Tg10Vy",
        "_score": 1.0,
        "_source": {
          "test_field": [
            "word",
            "word",
            "word",
            "word",
            "word"
          ]
        },
        "_explanation": {
          "value": 1.0,
          "description": "script score function, computed with script:\"Script{type=inline, lang='painless', idOrCode='return doc.test_field.size();', options={}, params={}}\"",
          "details": []
        }
      },
      {
        "_shard": "[test-index][0]",
        "_node": "pJAjqOHJSP2UgyvCe_m_8g",
        "_index": "test-index",
        "_id": "E-Lve40BRTvtv3Tg5kVi",
        "_score": 1.0,
        "_source": {
          "test_field": [
            "word",
            "word"
          ]
        },
        "_explanation": {
          "value": 1.0,
          "description": "script score function, computed with script:\"Script{type=inline, lang='painless', idOrCode='return doc.test_field.size();', options={}, params={}}\"",
          "details": []
        }
      }
    ]
  }
}

I see that the stored document still contains the duplicate strings in the field, so I am guessing that this is something that happens at index time?

Is this the correct behaviour, and if so, is there a way I can perform the query I need with the scoring I describe above?

RabBit_BR · February 6, 2024, 6:46am

Hi @aeroplane380

I don't know your requirements but you could use a runtime_mappings to retrieve the size of the array and the score function to score the docs with the highest number of elements in the array.

{
  "fields": [
    "size_array"
  ],
  "runtime_mappings": {
    "size_array": {
      "type": "long",
      "script": {
        "source": "emit(params['_source'].test_field.length)"
      }
    }
  },
  "query": {
    "function_score": {
      "query": {
        "multi_match": {
          "query": "good",
          "fields": [
            "test_field"
          ]
        }
      },
      "functions": [
        {
          "field_value_factor": {
            "field": "size_array",
            "factor": 1
          }
        }
      ]
    }
  }
}

aeroplane380 · February 6, 2024, 9:52am

Thanks for the reply.

I actually need to score based on the number of occurrences of my search term, not on the number of elements in the array in total.

system · March 5, 2024, 9:52am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Increase score for unique words matched Elasticsearch	2	677	March 18, 2020
Full text query multi_match document scores Elasticsearch	2	985	October 9, 2019
Returning too many results? Elasticsearch	4	1678	August 15, 2019
Multi match query and the scoring Elasticsearch	1	319	July 8, 2020
Keyword search not behaving as expected Elasticsearch	5	264	May 17, 2023

Keyword Array Field Removes Duplicates in Queries

Related topics