Terms aggregation on more than one term

I know that term aggregation can return the buckets with how many docs that one specific term resides in.
Recently I got a task to aggregate on a dataset to get the distribution of combined terms.
eg.

I have the docs as below:

1.{ tags:["elasticsearch", "mongodb"] }
2. {tags:["redis", "mongodb", "spark"]}
3. {tags:["hadoop", "elasticsearch", "mongodb"]}
4. {tags:["redis", "spark"]}

If I limited the combined terms number to 2, is there a way in elasticsearch to get the data as below?
| combination | document |
| elasticsearch,mongodb | 1,3 |
| elasticsearch,hadoop | 3 |
| mongodb,redis | 2 |
| mongodb,spark | 2 |
| mongodb,hadoop | 3 |
| redis,spark | 2,4 |

The easiest way to do this, would be by storing all combinations of tags for each document in a separate field at index time, and then aggregate on this combination field instead of on the tags themselves.

If you do want to use aggregations, one way of doing this would be through a scripted term aggregation. The script would create an array of tag combinations out of all the tag values for each tags field. The aggregation is then done on these combination values rather than the original tag values.

So, assuming the data has been entered like this:

POST /test-aggs/test/_bulk
{ "index": { "_id": 1 }}
{ "tags" :["elasticsearch", "mongodb"] }
{ "index": { "_id": 2 }}
{ "tags" :["redis", "mongodb", "spark"]}
{ "index": { "_id": 3 }}
{ "tags" :["hadoop", "elasticsearch", "mongodb"]}
{ "index": { "_id": 4 }}
{ "tags" :["redis", "spark"]}

You could do a scripted tag aggregation like this:

GET _search
{
  "size": 0,
  "aggs": {
    "tags": {
      "terms": {
        "script" : "def list = []; def temp = doc['tags'].values.sort(false); for (i = 0; i < temp.size() - 1; i++) { for (j = i + 1; j < temp.size(); j++) { list.add(temp[i] + ',' + temp[j]) } }; return list;"
      },
      "aggs": {
        "ids": {
          "terms": {
            "field": "_uid"
          }
        }
      }
    }
  }
}

This would return:

{
  "took": 48,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "tags": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "elasticsearch,mongodb",
          "doc_count": 2,
          "ids": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "test#1",
                "doc_count": 1
              },
              {
                "key": "test#3",
                "doc_count": 1
              }
            ]
          }
        },
        {
          "key": "redis,spark",
          "doc_count": 2,
          "ids": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "test#2",
                "doc_count": 1
              },
              {
                "key": "test#4",
                "doc_count": 1
              }
            ]
          }
        },
        {
          "key": "elasticsearch,hadoop",
          "doc_count": 1,
          "ids": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "test#3",
                "doc_count": 1
              }
            ]
          }
        },
        {
          "key": "hadoop,mongodb",
          "doc_count": 1,
          "ids": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "test#3",
                "doc_count": 1
              }
            ]
          }
        },
        {
          "key": "mongodb,redis",
          "doc_count": 1,
          "ids": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "test#2",
                "doc_count": 1
              }
            ]
          }
        },
        {
          "key": "mongodb,spark",
          "doc_count": 1,
          "ids": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "test#2",
                "doc_count": 1
              }
            ]
          }
        }
      ]
    }
  }
}

Note that I sort the tag values first, to prevent the combinations “elasticsearch,mongodb” and “mongodb,elasticsearch” to end up in separate buckets. Also, I aggregate on _uid to get the document ID rather than _id, as _id is not indexed by default.

Also keep in mind that scripted aggregations will be slow on large datasets, and that inline scripting is turned off by default, so you may want to move the script to a file rather than use an inline script.

Hi abdon,

That is helpful. Thank you.