Hello,
I have a requirement in which I need to aggregate over multiple indexes, each being independent from the others and each containing potentially millions of documents. Each index has its own Ids but contains a hash property that can be used to identify duplicated items across indexes. Document type resembles the following:
On index 1
{ "id": 1, "hash": "abcdefg", "title": "the title 1", "category": "the category 1", "createdBy": "user", "modifiedDate": "2011-04-11T10:20:30Z" }
{ "id": 2, "hash": "asdfjklñ", "title": "the title 2", "category": "the category 1", "createdBy": "user2", "modifiedDate": "2011-04-11T10:20:30Z" }
On index 2
{ "id": 2, "hash": "fghijk", "title": "the title 2", "category": "the category 2", "createdBy": "user3", "modifiedDate": "2011-04-11T10:20:30Z" }
{ "id": 3, "hash": "abcdefg", "title": "the title 1", "category": "the category 1", "createdBy": "user", "modifiedDate": "2011-04-11T10:20:30Z" }
{ "id": 4, "hash": "lmnopq", "title": "the title 3", "category": "the category 3", "createdBy": "user2", "modifiedDate": "2011-04-11T10:20:30Z" }
I need to get the list of unique titles with the counts of unique documents across the indexes and having one big index is not an option for me. The result that I am looking for is something like this:
"category 1": 2
"category 2": 1
"category 3": 1
I am using Elasticsearch 2.4
I tried to use an aggregation like the following
{
"size": 0,
"aggs": {
"categories": {
"terms": {
"field": "category"
}
}
}
}
but this will return the counts including the duplicates, I tried as well the following aggregate
{
"size": 0,
"aggs": {
"duplicateCount": {
"terms": {
"field": "hash",
"min_doc_count": 1
},
"aggs": {
"categories": {
"terms": {
"field": "category"
}
}
}
}
}
}
But this aggregate returns each hash with a bucket with the title for that hash only.
Any idea how can I achieve this? Is this even possible?
Thanks in advance,
Manuel