Keyword Array Field Removes Duplicates in Queries

This post relates to a problem I have encountered in my business production database, but will be described with a minimal reproducible example.

I have an index with the following mapping:

{
  "properties": {
    "test_field": {
      "type": "keyword"
    }
  }
}

I have two documents in the index:

{"test_field": ["good", "good", "good", "good"]}
{"test_field": ["good", "good"]}

I am trying to perform a search on terms in this keyword field. When I search for good, I want the first document to have a higher score because there are more matching terms.

As far as I understand, norms are disabled by default for keyword fields so this shouldn't affect the scoring. When I perform a multi_match query on this field (there are more fields in the query in my production database), both of these documents receive the same score. As the following excerpt shows, freq is computed as 1.

{
  "value": 0.45454544,
  "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
  "details": [{
    "value": 1,
    "description": "freq, occurrences of term within document",
    "details": []
  },
  ...

After investigation, it appears that the query thinks there is only one element in the field, as shown by this test query:

{
  "query": {
    "bool": {
      "must": [{
        "script_score": {
          "query": {
            "multi_match": {
              "query": "word",
              "fields": ["test_field"]
            }
          },
          "script": {
            "source": "return doc.test_field.size();"
          }
        }
      }]
    }
  }, 
  "explain": true
}

which returns the following

{
  "took": 39,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1.0,
    "hits": [
      {
        "_shard": "[test-index][0]",
        "_node": "pJAjqOHJSP2UgyvCe_m_8g",
        "_index": "test-index",
        "_id": "EuLve40BRTvtv3Tg10Vy",
        "_score": 1.0,
        "_source": {
          "test_field": [
            "word",
            "word",
            "word",
            "word",
            "word"
          ]
        },
        "_explanation": {
          "value": 1.0,
          "description": "script score function, computed with script:\"Script{type=inline, lang='painless', idOrCode='return doc.test_field.size();', options={}, params={}}\"",
          "details": []
        }
      },
      {
        "_shard": "[test-index][0]",
        "_node": "pJAjqOHJSP2UgyvCe_m_8g",
        "_index": "test-index",
        "_id": "E-Lve40BRTvtv3Tg5kVi",
        "_score": 1.0,
        "_source": {
          "test_field": [
            "word",
            "word"
          ]
        },
        "_explanation": {
          "value": 1.0,
          "description": "script score function, computed with script:\"Script{type=inline, lang='painless', idOrCode='return doc.test_field.size();', options={}, params={}}\"",
          "details": []
        }
      }
    ]
  }
}

I see that the stored document still contains the duplicate strings in the field, so I am guessing that this is something that happens at index time?

Is this the correct behaviour, and if so, is there a way I can perform the query I need with the scoring I describe above?

Hi @aeroplane380

I don't know your requirements but you could use a runtime_mappings to retrieve the size of the array and the score function to score the docs with the highest number of elements in the array.

{
  "fields": [
    "size_array"
  ],
  "runtime_mappings": {
    "size_array": {
      "type": "long",
      "script": {
        "source": "emit(params['_source'].test_field.length)"
      }
    }
  },
  "query": {
    "function_score": {
      "query": {
        "multi_match": {
          "query": "good",
          "fields": [
            "test_field"
          ]
        }
      },
      "functions": [
        {
          "field_value_factor": {
            "field": "size_array",
            "factor": 1
          }
        }
      ]
    }
  }
}

Thanks for the reply.

I actually need to score based on the number of occurrences of my search term, not on the number of elements in the array in total.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.