I'm trying to get a histogram (not a date histogram) from the amount of time our users logged in in a given time window. We store a user's username in each document related to any activity from that specific user, so I can get part of what I'm looking for by doing a query restricting the result to the right time window, with a terms aggregation on the username field.
Where it gets tricky is that, because we have a lot of users (I'm talking tens of thousands, and I don't want to graph a histogram bar for each of them, for obvious reasons), I'm trying to summarise the results with a histogram from the results of the terms aggregations.
So far I have tried a few things I found here and there over the web, but because I couldn't find much people with a similar need, and my lack of knowledge on advanced use of ES's search API, I couldn't find any way to do this. Would anyone have an idea on that?
OK. I'm assuming a user's "number of connections" can be derived from the number of events of the type "successful log-in" or similar.
The final result is tiny - let's say a bar chart with a hundred bars each with a single integer.
However, the interim state required to pull this summary together from your event store could be massive - each unique user needs to have a count of all successful logins returned to the reducing node just to figure out that user X belongs in the bar representing the users with between 0 and 10 connections. With "tens of thousands of users" to consider that's a lot but might just about be manageable.
I expect you'd need to have some client-side code though to collapse the raw data into the final histogram.
If you had an entity-centric index that maintained a single doc per user and an array of their login dates it might be possible to do this in one request - the query would select users with at least one log-in during the required time-period, a histogram agg with a script would rip through the array of user's login dates to count those falling in the time period and this count would provide the script's return value used for assigning the user to a bucket (e.g. those users with 0-10 logins).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.