Terms aggregation on more than one term

yehua984710 · April 13, 2016, 10:54am

I know that term aggregation can return the buckets with how many docs that one specific term resides in.
Recently I got a task to aggregate on a dataset to get the distribution of combined terms.
eg.

I have the docs as below:

1.{ tags:["elasticsearch", "mongodb"] }
2. {tags:["redis", "mongodb", "spark"]}
3. {tags:["hadoop", "elasticsearch", "mongodb"]}
4. {tags:["redis", "spark"]}

abdon · April 15, 2016, 10:07am

The easiest way to do this, would be by storing all combinations of tags for each document in a separate field at index time, and then aggregate on this combination field instead of on the tags themselves.

If you do want to use aggregations, one way of doing this would be through a scripted term aggregation. The script would create an array of tag combinations out of all the tag values for each tags field. The aggregation is then done on these combination values rather than the original tag values.

So, assuming the data has been entered like this:

POST /test-aggs/test/_bulk
{ "index": { "_id": 1 }}
{ "tags" :["elasticsearch", "mongodb"] }
{ "index": { "_id": 2 }}
{ "tags" :["redis", "mongodb", "spark"]}
{ "index": { "_id": 3 }}
{ "tags" :["hadoop", "elasticsearch", "mongodb"]}
{ "index": { "_id": 4 }}
{ "tags" :["redis", "spark"]}

You could do a scripted tag aggregation like this:

GET _search
{
  "size": 0,
  "aggs": {
    "tags": {
      "terms": {
        "script" : "def list = []; def temp = doc['tags'].values.sort(false); for (i = 0; i < temp.size() - 1; i++) { for (j = i + 1; j < temp.size(); j++) { list.add(temp[i] + ',' + temp[j]) } }; return list;"
      },
      "aggs": {
        "ids": {
          "terms": {
            "field": "_uid"
          }
        }
      }
    }
  }
}

This would return:

{
  "took": 48,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "tags": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "elasticsearch,mongodb",
          "doc_count": 2,
          "ids": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "test#1",
                "doc_count": 1
              },
              {
                "key": "test#3",
                "doc_count": 1
              }
            ]
          }
        },
        {
          "key": "redis,spark",
          "doc_count": 2,
          "ids": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "test#2",
                "doc_count": 1
              },
              {
                "key": "test#4",
                "doc_count": 1
              }
            ]
          }
        },
        {
          "key": "elasticsearch,hadoop",
          "doc_count": 1,
          "ids": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "test#3",
                "doc_count": 1
              }
            ]
          }
        },
        {
          "key": "hadoop,mongodb",
          "doc_count": 1,
          "ids": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "test#3",
                "doc_count": 1
              }
            ]
          }
        },
        {
          "key": "mongodb,redis",
          "doc_count": 1,
          "ids": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "test#2",
                "doc_count": 1
              }
            ]
          }
        },
        {
          "key": "mongodb,spark",
          "doc_count": 1,
          "ids": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "test#2",
                "doc_count": 1
              }
            ]
          }
        }
      ]
    }
  }
}

Note that I sort the tag values first, to prevent the combinations “elasticsearch,mongodb” and “mongodb,elasticsearch” to end up in separate buckets. Also, I aggregate on _uid to get the document ID rather than _id, as _id is not indexed by default.

Also keep in mind that scripted aggregations will be slow on large datasets, and that inline scripting is turned off by default, so you may want to move the script to a file rather than use an inline script.

yehua984710 · April 18, 2016, 5:36am

Hi abdon,

That is helpful. Thank you.

Topic		Replies	Views
How to group documents and aggregate on groups Elasticsearch	4	5655	June 18, 2018
Terms Aggregation on Array of String Elasticsearch	2	6711	September 17, 2019
Aggregating on multiple terms for multiple fields Elasticsearch	1	633	July 5, 2017
Aggregation provides multiplied results Elasticsearch	3	365	August 16, 2019
Terms Aggregations Elasticsearch	11	626	July 6, 2017

Terms aggregation on more than one term

Related topics