I have an index with nested objects and am trying to derive some statistics about the data, but am having trouble doing so.
The documents look like this simplified version:
{
"rn": 1,
"t": "X",
"a": [
{
"r": [
{
"rt": "A",
"p": [
{
"sn": 1
},
{
"sn": 2
}
]
},
{
"rt": "B",
"p": [
{
"sn": 1
},
{
"sn": 2
}
]
},
{
"rt": "C",
"p": [
{
"sn": 1
},
{
"sn": 2
}
]
}
]
}
]
}
{
"rn": 2,
"t": "X",
"a": [
{
"r": [
{
"rt": "A",
"p": [
{
"sn": 1
},
{
"sn": 2
}
]
},
{
"rt": "B",
"p": [
{
"sn": 1
},
{
"sn": 2
}
]
}
]
}
]
}
The arrays, a, r and p, are mapped as nested objects.
I want to write an aggregation to return:
- The maximum number of occurrences of r in a document (3 in this example)
- The average number of occurrences of r in a document (2.5 in this example)
- Determine the above two within the buckets for terms in t (just X in this example, but there will be a dozen or so other terms; I can just run a query a dozen times with different values for a filter on t if needed)
In a non-nested version of this, I used a painless script to count the elements in each document and ran stats over the result. But now, since r is nested, I can't just count the elements in r because r is now split into separate document. I think I need to have an agg that counts the nested documents per document, then another agg to run stats over those counts. But I don't see how to do that. Do I need to basically make a bucket for each document? There are nearly 1B documents in the index.
I am having trouble even working out which aggs I can use, and in what order. I have tried variations of value_count, max_buckets, reverse_agg and a number of others, without success and am clearly missing some understanding of how this agg should work.
Any help would be greatly appreciated.