I would like to obtain a list of special_id s, where there are no documents with "phase" = "create". Is there a way to obtain such list without external scripts?
Thanks!
Aba
What I don't know how to do, is to extract the values, where "doc_count" = 0 for the "key" = "create". I think it would be enough if I could sort by "All_IDs.phasetype.doc_count", but I have no idea how.
There is still a problem with this approach. I want to have only those values in the result, where the doc_count = 0, but this one still prints everything. See below result:
Here is a minimal repeatable example. This will filter out special_id buckets based on the count of values that match the create phase. Specifically, when count is 0
Terms aggs supports the include parameter. This is much more performant than adding your own top level filter if you are only filtering on those term values
bucket_selector is a sub agg for the ALL_IDs and is filtering based on the sub-agg within the phase_createaggregation
Honestly, it seems to me that the _count path should work with filters aggregations. I am not 100% sure why it doesn't but that smells like a bug to me.
I removed the "phaseUpdate" filter, so I can use the _count and I added a second bucket_selector to filter out the empty buckets.
Anyhow, your solution looks better. Thank you very much for all your help!
Now the only problem which remains to be solved is to use this _search on all the IDs (there are approximately 500000), and all the documents (approx 10million documents). I think I should split it somehow by special_id intervals. The only issue is, that special_id is really special. It is keyword .
Any system that tries to get a holistic view of each entity is going to struggle if the related data for each entity is distributed somewhat randomly across multiple machines.
The terms aggregation is useful for grouping together related items under a common key but will fail to do so reliably if there are very large numbers of keys spread across multiple machines. Not all data nodes may agree exactly on which keys to gather for final analysis on the search coordinating node, meaning some data may be missing for keys presented in the final result.
Your sort of behavioural analysis looks like a classic use case for an entity centric index created using the transforms api.
My data is not yet distributed across multiple machines. This is index has only 1 shard.
GET _cat/shards/restored_prod_event_store20210729?v
index shard prirep state docs store ip node
my_index 0 p STARTED 450548050 37.9gb 123.123.123.123 test-node-3
Sidenote: The doc count is 450548050, however the number of documents I am interested in is actually smaller. (I am also filtering based on another field.)
See below result for the search call written above in the earlier posts:
But I understand what you are saying, the "relation" : "gte"is clearly not ok for me, and the "sum_other_doc_count" : 9846594 is also way too high. I will review the transforms api as you suggested, and I hope I can figure something out. Any other tips which can help me on the way are very much appreciated. Thank you for your help!
What you may find is that by transforming low-level log records into entities you can do a lot more interesting analysis e.g. average length of time taken for different phases, max number of phase changes per entity etc.
Derived entity attributes like "duration" or "lastKnownState" are typically useful things the transform API can help with. It shifts the computational cost and complexity of doing this sort of analysis from query-time to index-time.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.