How to group documents and aggregate on groups

Let's say I have an Elasticsearch index called "item-tag" where each document contains only one itemid (long) and only one tag (keyword). There are multiple documents with same item id and multiple documents with same tag but the documents are unique so the same combination of itemid and tag doesn't appear twice.

So how can I get the number of unique itemids for some two different tags?
Is the only way to create a new index where there is only one document per itemid and each document has an array of tags or is it possible with the current index?

Data in index:

itemid:1 tag:A
itemid:1 tag:B
itemid:1 tag:C
itemid:2 tag:B
itemid:2 tag:C
itemid:2 tag:D
itemid:3 tag:B
itemid:3 tag:D

Wanted output:

  A B C D
A 1 1 1 0
B 1 3 2 2 
C 1 2 2 1  
D 0 2 1 2      

Regards, Markus Paaso

The easiest way to do it is to use a script on the terms aggregation: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-script

Something like this:

GET /_search
{
    "aggs" : {
        "genres" : {
            "terms" : {
                "script" : {
                    "source": "doc['itemid'].value + '_' + doc['tag'].value",
                    "lang": "painless"
                }
            }
        }
    }
}

The script will emit the itemid_tag tuple for each document, rather than an individual field. That should give you the total set of tuples.

Scripts are a bit slower, however, since it has to evaluate on each document. The fastest way to get this data is to index the tuple into the document directly. You can either do it in your application (just include an extra field), or you could use the copy_to functionality in Elasticsearch to copy both item and tag into a single field (which is mapped as a keyword to make it a single token).

That'll give faster lookups because Elasticsearch just has to group the existing tokens, rather than execute the script.

That said, I'd use the scrip to maintain flexibility, and only index the tuple directly if you really need the speed.

Thanks for the fast answer.

I didn't mean to get itemid - tag tuples but tag - tag tuples with count of unique itemids.

So return should look like:

[
    {"tag1":"B", "tag2":"C", "itemid_count":2},
    {"tag1":"C", "tag2":"D", "itemid_count":1},
    ...
] 

In my actual index there are 170000 different itemids but only 12 different tags.
Each itemid has an average of 2.5 tags. So there are totally 425000 different itemid - tag pairs there.
As there are only 12 different tags, the results should consist of 12 * 12 = 144 itemid counts.

I hope this question is more clear now.

I don't think so there is a single query that will be able to generate data in the format as specified above.. However this can be achieved by performing some manipulation on the responses..

You can use either use the query referred by @polyfractal above or you can use nested aggregations (aggregation within aggregation, & not nested aggregation which is a special type of aggregation in ES). I am listing down an approach based on it below.

{"aggs": {"ITEMID": {"terms": {"field": "itemid", "size": 100}, "aggs": {"TAG": {"terms": {"field": "tag", "size": 100}}}}}}

Based on the response, you can prepare the matrix at the client side by iterating the responses. Tags existing against each itemid will be returned in the response. By using appropriate data structures (Dictionary/hash table), it can be easily converted into the desired format.

Hope this answers your question?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.