Counting each key of an array inside aggregation

I am trying to make an aggregation in which I can count the number of times each key specific of an array is repeated. As an example:

curl -XPUT "localhost:9200/products/_mapping" -H 'Content-Type: application/json' -d'
{
  "properties": {
    "words": {
      "type": "keyword"
    }
  }
}'
curl -X POST "localhost:9200/products/doc/_bulk" -H 'Content-Type: application/json' -d'
{"index":{"_id":1}}
{"words":["room","kitchen","room"]}
{"index":{"_id":2}}
{"words":["room","restroom"]}
'

I wanted to have a count of
room: 3
kitchen: 1
restroom: 1
From what I got checking different similar questions, value_count should help me on that, but what I get is:

curl -X POST "localhost:9200/products/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggs": {
    "words": {
      "terms": {
        "field": "words"
      },
      "aggs": {
        "total": {
          "value_count": {
            "field": "words"
          }
        }
      }
    }
  }
}
'
{
  "aggregations" : {
    "words" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "room",
          "doc_count" : 2,
          "total" : {
            "value" : 4
          }
        },
        {
          "key" : "kitchen",
          "doc_count" : 1,
          "total" : {
            "value" : 2
          }
        },
        {
          "key" : "restroom",
          "doc_count" : 1,
          "total" : {
            "value" : 2
          }
        }
      ]
    }
  }
}

As you can see, it's counting the whole array size whenever the key is present. Is there a way to count only the times it appear in the array?

I realized value_count aggregation works as the sum of unique values in documents in buckets.

One possility is using scripted metric aggregation.
doc[field] are de-duplicated and I used params._source instead. This aggregation could be slow.

GET /test_products/_search
{
  "size":0,
  "aggs":{
    "value_count":{
      "scripted_metric": {
        "init_script": "state.map = new HashMap()",
        "map_script": "if (params._source[params.field] instanceof List) {for (val in params._source[params.field]){state.map[val] = state.map.getOrDefault(val,0)+1}} else {state.map[params._source[params.field].value] = state.map.getOrDefault(params._source[params.field].value,0)+1}",
        "combine_script": "return state.map",
        "reduce_script": "Map m = new HashMap();for (map in states){map.forEach((k,v)->m[k]=m.getOrDefault(k,0)+v)} return m",
        "params":{
          "field": "words"
        }
      }
    }
  }
}

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "value_count" : {
      "value" : {
        "kitchen" : 1,
        "room" : 3,
        "restroom" : 1
      }
    }
  }
}

That works quite like I wanted, but it takes 15x more time to calculate. So, I am trying to find a solution reorganizing the field to be a hash of counts. Still strugging with the aggregation..
Thanks!

1 Like

Another way is using ingest pipeline to count terms in exchange for loads of indexing. Below is my sample script just for my practice.

And using this sterategy, you have to specify every words to sum counts in sum aggregation.

PUT _ingest/pipeline/test_word_count
{
    "processors": [
      {
        "script":{
          "source":"""
          Map map = new HashMap();
          if (ctx[params.field] instanceof List) {
            for (val in ctx[params.field]){
              map[val] = map.getOrDefault(val,0) + 1
            }
          } else {
            map[params.field] = 1
          }
          ctx[params.target] = map
          """,
          "params":{
            "field":"words",
            "target":"words_count"
          }
        }
      }
    ]
  }
  
PUT /test_products/_settings
{
  "index.default_pipeline": "test_word_count"
}

POST test_products/_update_by_query

GET test_products/_search

The generation of this map is not a problem for me as I can change the structure of the field to whatever I want. The problem is the query with script which ran over a very big dataset can be too expensive.
I could also create this field with information structured like:

{ "room": 2, "restroom": 1 }

or

{"topic": "room", "count": 2}, {"topic": "restroom", "count": 1 }

Thanks for helping!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.