Hi say I have two fields "catA" and "catB" respectively indexed. A has possible values [ "1" ,"2", "n" ] and B has possible values ["i" ,"ii" ,-"n"] within the document set. I want to be able to know for what values of catB , the count of distinct catA values is >1
What you are looking for will be a combination of Aggregations, like this (in pseudocode, I don't have an elastic instance handy sorry):
A terms aggregation on "catB" to 'group by' those values, with a sub-aggregation of type 'value_count' to count the 'catA' values within the enclosing catB.
Then, as a sibling agg to the 'terms' you'll want a 'bucket_selector' to filter the results...
So at a rough stab, I think it would be something like this... Apologies if it doesn't run, just doing it off the top of my head!
Thanks so much for the answer . Its almost there.. I can get this to work
"aggs" : {
"summary" : {
"terms" : { "field" : "catB", "size" : "200"},
"aggs": {
"catAcount" : { "cardinality" : { "field" : "catA" } }
}
}}}
Which changes the "value_count" to "cardinality" this will tell me the number of values for catA in each document that contains the CatB field. I could not get the bucket aggregator to work tho. Something to do with the brackets as I am getting errors about sibling queries not allowed etc. If anyone has anything to offer on getting the above working with the bucket aggregator so as to get a list of the catA values it would be great... either way thanks for the help. Im halfway there
Hello mate -sorry my first attempt didn't work! Looking at it, perhaps I put the 'bucket_selector' in the wrong spot? What if it is shifted to be contained by the 'terms' structure - i.e. like this:
(I should note here, I haven't used the bucket selector before, but I do tend to use other pipeline aggs a fair bit, so I think it should work ok with the right tweak
"error": {
"root_cause": [
{
"type": "parsing_exception",
"reason": "Aggregation definition for [having starts with a [VALUE_STRING], expected a [START_OBJECT].",
Thanks for trying again, alas it doesnt work though. I have twiddled with it little myself but no luck so far.. thanks again.
Ha ha sorry been out grabbing a coffee - you got it working though? Sorry if my syntax was off... Getting these aggs working right can be finicky - I usually seem to need a few goes to get it right
I know this query is marked complete but I don have an additional related question which I think would be misplaced in a new post. So here goes. If the final code is like this:
"summary" : {
"terms" : { "field" : "catB", "size" : "200"},
"aggs": {
"catAcount" : { "value_count" : { "field" : "catA" } },
"having" : "bucket_selector": {
"buckets_path": {
"catAcounts": "catAcount"
},
"script": "params.catAcounts > 1"
}
}
}
}
how do I do a sub aggregation on the buckets? as in where do I nest the sub aggregation ? I found the documentation particularly vague and nebulous on this matter... anyway if you have any idea it would be great. Also thanks a million for the help.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.