I use a filter query to select document only match certain ids in a lit. I would like to get back as metric the number of ids that matched that document but if I use cardinality aggregation I get the cardinality of the all ids in the documents. Example
Let say that I have 2 document. Document A has ids = ['000', '111', '222'] and document B has ids = ['111', '222']
By filtering for ids = ['000'] only document A will match the query and by executing a cardinality aggregation over ids the result is 3. I would like to get back only the number of ids matching the document in this case 1.
I am trying with scripted metric aggregation but with no luck. I get back only one value back per shard instead one per document.
You can let Elasticsearch compute this without having to resort to complicated scripting, by making use of scoring.
If instead of using a terms query, you take each ID and use a term query for that ID, wrapped in a constant_score query, you will get a score of 0 or 1 depending on whether that ID is a match for a given document. You can then combine all these term queries for all the IDs that you are interested in inside of a bool query's should clause, like this:
The score of each document will now be the number of IDs that match for that document.
If you want to know the total number of IDs that matched across all documents, you can combine this query with a sum aggregation that sums all of the individual scores:
I have multiple documents that same ref_id but different values for other filed (e.g. timestamp). Because of that the score that I get back is not the number of unique ids for each reference id but instead the sum of all the matches.
So instead of getting back 1 or 2 per each ref_id, I get back the sum of all the matches for each document. Something like how many times the target_ids are seen in total instead I want them to be distinct, once you saw it once you should not count it anymore.
I did some more work on the script and I finally got it working but soon I found the limitation that the script cannot be more than 16k characters long and I want to have lists of target_ids of 10k or more.
I was thinking about another solution, what if I would compute the intersection later and get back from the aggregation the unique list of the target_id per ref_id bucket? I make the inner aggregation a term aggregation and get back a dictionary with the target_ids but it will be restricted by the size I choose.
I am only interested in the actual size of the target_ids terms aggregation. I understand that doing this from a custom application point of view is super easy (just count the key in the bucket) but I am trying . to plug this into Kibana to get back a sort of Cardinality count but only on the restricted value I passed in the filter and include.
Is there any way to get back from Elasticsearch the size of the target_ids terms aggregation instead the terms aggregation results?
I actually created a new metric in Kibana to run the restricted unique count on the passed filter but it seems that Kibana is able to handle metrics that return a single value per bucket instead of another aggregation. Another solution would be to have Kibana be able to handle responses that are not just a value but a more complex JSON.
But this is getting out of the main scope of my question.
However, the results cannot be sorted by _bucket_count this will ha e the side effect that some ref_id with a higher _bucket_count count will be cut out by ref_id with a higher doc_count.
I tried to use scripted_metric, although I was able to achieve the functionality I wanted the script length limitation of 16384 characters will not allow me to make this work in a real use case.
Does anyone have a better idea how to solve this problem?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.