Exclude array items in cardinality aggregation

The field mapping is like this:

     [
         {
           "type": 1,
           "document_id": [4, 5]
         },
         {
           "type": 1,
            "document_id": [4]
         },
         {
           "type": 2,
           "document_id": [5]
         },
         {
           "type": 2,
           "document_id": [4,5]
         }
       ]  

Now I am trying to get the unique document id count of type 1 and type 2, the tricky part is, I don't want to count the document ids again in type 2 if they had been counted in type 1.

For example, by using the cardinality aggregation

 {
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "type": 1
          }
        }
      ]
    }
  },
  "aggs": {
    "document_count": {
      "cardinality": {
        "field": "document_id"
      }
    }
  }
}

I can get there are 2 unique document ids in type 1. If I do the same query for type 2, I will get count 2 as well.
But the expected result I am trying to get is counting 2 for type 1, counting 0 for type 2 because I'd like to exclude document id 4 and 5 from type 2 since they have been counted in type 1.

Does anyone know if this is doable please?
Thanks!

I've tried to solve the issue with scripted_metric aggregation.

Demo data

PUT testd/_doc/1
{
  "type": 1,
  "document_id": [
    4,
    5
  ]
}
PUT testd/_doc/2
{
  "type": 1,
  "document_id": [
    4
  ]
}
PUT testd/_doc/3
{
  "type": 2,
  "document_id": [
    5
  ]
}
PUT testd/_doc/4
{
  "type": 2,
  "document_id": [
    4,
    5
  ]
}

Aggregation

This can only work on Elasticsearch 7.7.

GET testd/_search
{
  "aggs": {
    "NAME": {
      "scripted_metric": {
        "init_script": "state.types = new HashMap();",
        "map_script": "def t = doc['type'].value.toString(); if (!state.types.containsKey(t)) { state.types[t] = new HashSet(); }\nstate.types[t].addAll(doc['document_id']);",
        "combine_script": "return state;",
        "reduce_script": "def type1 = new HashSet(); def type2 = new HashSet(); for (s in states) { type1.addAll(s.types['1']); type2.addAll(s.types['2']); } type2.removeAll(type1); return [ '1': type1.size(), '2': type2.size() ]"
      }
    }
  }
}

Alternative for pre 7.7 (slightly less efficient).

GET testd/_search
{
  "aggs": {
    "NAME": {
      "scripted_metric": {
        "init_script": "state.types = new HashMap();",
        "map_script": "def t = doc['type'].value.toString(); if (!state.types.containsKey(t)) { state.types[t] = new HashMap(); }\n for(d in doc['document_id']) { state.types[t][d] = true; }",
        "combine_script": "return state;",
        "reduce_script": "def type1 = new HashSet(); def type2 = new HashSet(); for (s in states) { type1.addAll(s.types['1'].keySet()); type2.addAll(s.types['2'].keySet()); } type2.removeAll(type1); return [ '1': type1.size(), '2': type2.size() ]"
      }
    }
  }
}

Result:

...
  "aggregations" : {
    "NAME" : {
      "value" : {
        "1" : 2,
        "2" : 0
      }
    }
  }
}

1 Like

Thanks a lot @Luca_Belluccini! That is exactly what I need.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.