Query with two aggregations


I am tying to write a query to first get the last state value per customer, and then counting those up.

So I have logs of customers with states. Each time a state of a customer changes, I write this to a log file. And through a query I want to be able to count how many of customer is a "available" state at the moment.

It should not count the customer, that were available once and not available any longer OR should not count the customers twice if an available customer became unavailable first and then available again.

Thanks in advance!

This potentially expensive to compute if you have to consider every customer as an individual, determining the latest status for each.
Would it perhaps be possible to compute a "good-enough" stat by subtracting counts of the various states? e.g.

numAvailable = totalNumAvailableStatesRecorded - totalNumUnavailableStatesRecorded

This would only work if each customer only ever has 2 states recorded - available then unavailable.

@Mark_Harwood Thank you for the answer. My data has more than 2 states.

You said it is expensive to compute, does this mean it is possible?? I am at the moment interested in getting the exact accurate result rather than having performance issues.

My data has more than 2 states.

If the states are mutually exclusive and recorded once per customer then that sort of arithmetic may still be accurate.

You said it is expensive to compute, does this mean it is possible??

Yes, but may require you to re-think index design due to processing costs of performing distributed joins. See entity centric indexing: https://www.youtube.com/watch?v=yBf7oeJKH2Y

Other people seem to have different understandings of what you're asking, but from my understanding (log lines relating customer <-> state, and you're interested in the latest such for each customer) you start by using a terms aggregation to sort the log lines into per-customer buckets, and then you use a top hits aggregation to get the latest state for each customer. So far so standard, and works fine - it can't, I hope, be very difficult to work out how to add a third aggregation to count how many you've then got in each state? Another terms aggregation into buckets by state?

@TimWard. @Mark_Harwood to be more precise;

Here are the two screenshots of my data representation:

The first is the logs for the customers, their states and the date the log is written.

I would like to be able to count the number of customers they are currently in greens, reds and yellows state.

In this example, when I query red, I should get 1 since Ingolstadt CC turned to green later OR when I query for green, the reult should be 8 AND for yellow, the result should be 2.

So I thought about writing a query. I have only used the Kibana editor so far for my visualizations. Therefore asking if this would be possible by writing a query.

Thank you!

OK so it looks like the state transitions aren't always in the same sequence so my arithmetic suggestion wouldn't work. You need to consider each customer individually.

If you have many (eg hundreds of thousands plus) of customers then this causes the "bucket explosion" problem referenced in my video link.
This will be too much work to attempt in a single query. You can try use term partitioning to break the analysis into several smaller queries. The docs even include an example of how to get the latest value for each ID. This will not play nicely with Kibana though.
Using the entity centric example I gave the cons are a more complex indexing strategy but the pros are faster, simpler queries and Kibana-compatible indexes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.