Undercounting of doc_counts in Terms agg V.S. scripted terms agg

gpiimetrics · March 30, 2020, 5:26pm

I was experimenting a scripted terms agg that converts a field to String, because we have all data types at different times of the project (I know it's crazy but have to deal with it now).

Then I noticed a mismatch in bucket "doc_count" between:

straightforward terms agg
scripted terms agg where the script returns [any value].toString() -- which I suppose does the same thing as 1.

Sample code below:

GET gpii-*/_search
{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "json.timestamp": {
              "gte": "2020-01-01",
              "lte": "2020-02-29"
            }
          }
        },
        {
          "term": {
            "json.event.keyword": {
              "value": "setting-changed"
            }
          }
        },
        {
          "term": {
            "json.data.path.keyword": {
              "value": """http://registry\.gpii\.net/common/selfVoicing/enabled"""
            }
          }
        },
        {
          "exists": {
            "field": "json.data.value"
          }
        }
      ]
    }
  },
  "aggs": {
    "by_term": {
      "terms": {
        "field": "json.data.value.keyword",
        "size": 10000
      }
    },
    "by_string": {
      "terms": {
        "script": """
              return params._source.json.data.value?.toString() ?: "";
              """,
        "size": 10000
      }
    }
  }
}

However, the doc_counts in (1) "by_term" are always incorrect and doc_counts in (2) "by_string" are always correct.

Sample results are as follows:

{
  "took" : 4221,
  "timed_out" : false,
  "_shards" : {
    "total" : 2735,
    "successful" : 2735,
    "skipped" : 2401,
    "failed" : 0
  },
  "hits" : {
    "total" : 22,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "by_term" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "true",
          "doc_count" : 10
        },
        {
          "key" : "false",
          "doc_count" : 5
        }
      ]
    },
    "by_string" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "true",
          "doc_count" : 15
        },
        {
          "key" : "false",
          "doc_count" : 7
        }
      ]
    }
  }
}

The mismatch happens when the original data type is boolean (as in the example above), and also when the original type is integer, double, and string.

What might cause the undercounting in terms agg? Please advise.

system · April 27, 2020, 5:26pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.