I am tying to write a query to first get the last state value per customer, and then counting those up.
So I have logs of customers with states. Each time a state of a customer changes, I write this to a log file. And through a query I want to be able to count how many of customer is a "available" state at the moment.
It should not count the customer, that were available once and not available any longer OR should not count the customers twice if an available customer became unavailable first and then available again.
This potentially expensive to compute if you have to consider every customer as an individual, determining the latest status for each.
Would it perhaps be possible to compute a "good-enough" stat by subtracting counts of the various states? e.g.
@Mark_Harwood Thank you for the answer. My data has more than 2 states.
You said it is expensive to compute, does this mean it is possible?? I am at the moment interested in getting the exact accurate result rather than having performance issues.
If the states are mutually exclusive and recorded once per customer then that sort of arithmetic may still be accurate.
You said it is expensive to compute, does this mean it is possible??
Yes, but may require you to re-think index design due to processing costs of performing distributed joins. See entity centric indexing: https://www.youtube.com/watch?v=yBf7oeJKH2Y
Other people seem to have different understandings of what you're asking, but from my understanding (log lines relating customer <-> state, and you're interested in the latest such for each customer) you start by using a terms aggregation to sort the log lines into per-customer buckets, and then you use a top hits aggregation to get the latest state for each customer. So far so standard, and works fine - it can't, I hope, be very difficult to work out how to add a third aggregation to count how many you've then got in each state? Another terms aggregation into buckets by state?
I would like to be able to count the number of customers they are currently in greens, reds and yellows state.
In this example, when I query red, I should get 1 since Ingolstadt CC turned to green later OR when I query for green, the reult should be 8 AND for yellow, the result should be 2.
So I thought about writing a query. I have only used the Kibana editor so far for my visualizations. Therefore asking if this would be possible by writing a query.
OK so it looks like the state transitions aren't always in the same sequence so my arithmetic suggestion wouldn't work. You need to consider each customer individually.
If you have many (eg hundreds of thousands plus) of customers then this causes the "bucket explosion" problem referenced in my video link.
This will be too much work to attempt in a single query. You can try use term partitioning to break the analysis into several smaller queries. The docs even include an example of how to get the latest value for each ID. This will not play nicely with Kibana though.
Using the entity centric example I gave the cons are a more complex indexing strategy but the pros are faster, simpler queries and Kibana-compatible indexes
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.