Elasticsearch: find documents with distinct values and then aggregate over them


(Itamar Ravid) #1

(cross-posted at http://stackoverflow.com/questions/30649740/elasticsearch-find-documents-with-distinct-values-and-then-aggregate-over-them)

Hey everyone, question about aggregations.

My index has a log-like structure: I insert a version of a document whenever an event occurs. For example, here are documents in the index:

{ "key": "a", subkey: 0 }
{ "key": "a", subkey: 0 }
{ "key": "a", subkey: 1 }
{ "key": "a", subkey: 1 }
{ "key": "b", subkey: 0 }
{ "key": "b", subkey: 0 }
{ "key": "b", subkey: 1 }
{ "key": "b", subkey: 1 }

There are more fields to each document that differentiate between the versions, but they don't matter for the scope of this query. I'm trying to construct a query which is basically equivalent to the following SQL query:

SELECT COUNT(*), key, subkey
FROM (SELECT DISTINCT key, subkey FROM t)

The answer to this query would obviously be

(1, a, 0)
(1, a, 1)
(1, b, 0)
(1, b, 1)

How would I replicate this in Elasticsearch? I came up with the following:

GET test_index/test_type/_search?search_type=count
{
  "aggregations": {
    "count_aggr": {
      "terms": {
        "field": "concatenated_key"
      },
      "aggs": {
        "sample_doc": {
          "top_hits": {
            "size": 1
          }
        }
      }
    }
  }
}

concatenated_key is a concatenation of key and subkey. This query would create a bucket for each (key, subkey) combination and return a sample document from each bucket. However, I don't know how can I aggregate over the fields of _source.

Would appreciate any ideas. Thanks!


(system) #2