I'm using nested aggregations to find duplicates in a nested array. I only expect there to be few duplicates (say 3 duplicates out of 500 possible values across 10,000 documents). The response I get back from my query includes 500 buckets (one for each possible value) but 497 of them contain empty sub-buckets (meaning that those property names aren't duplicated). Is there a way to only return the 3 buckets that contain sub-buckets with data?
I'm not sure I explained that very well...
Here's my query (courtesy of @dantuff):
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"nested_properties": {
"nested": {
"path": "properties"
},
"aggs": {
"display_name": {
"terms": {
"field": "properties.display.raw",
"size": 1000000
},
"aggs": {
"doc_id": {
"terms": {
"field": "_uid",
"min_doc_count": 2,
"size": 1000000
},
"aggs": {
"properties_to_elements": {
"reverse_nested": {},
"aggs": {
"element_id": {
"top_hits": {
"_source": {
"include": [ "system.id", "system.sourceid" ]
}}}}}}}}}}}}}
And here are the first two top level buckets. Notice that the first bucket (key: "Length") contains a doc_id.buckets[] with data in it, whereas the second bucket (key: "Family Name") contains an empty doc_id.buckets[].
"aggregations": {
"nested_properties": {
"doc_count": 16487,
"display_name": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Length",
"doc_count": 391,
"doc_id": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "building-elements#AU3oVn402bOX65T_n7uZ",
"doc_count": 2,
"properties_to_elements": {
"doc_count": 1,
"element_id": {
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "5",
"_type": "building-elements",
"_id": "AU3oVn402bOX65T_n7uZ",
"_score": 1,
"_source": {
"system": {
"id": 263,
"sourceid": "f02ec989-d9a7-4c4b-bcc3-749786b1bfc1-00044f1a"
}}]}}}}] } },
{
"key": "Family Name",
"doc_count": 382,
"doc_id": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": []
}
},
....
LOTS OF OTHER EMPTY BUCKETS DELETED
....
]}}}}
Is there a way to have ES just return the results that don't have an empty doc_id.buckets array?
If there isn't then is there a way to sort the result so that the results with a non-empty doc_id.buckets array are at the top of the results?
Many thanks,
John H.