For each log entry, I have "user identifier" and "date" fields. I would like to find out the number of unique dates that a "user identifier" appears in.
Eg. For a given user identifier "1234567", if it appeared in different log entries of 5 different dates, then I would like the histogram to show "1234567": 5
If I have a million user identifiers then I would have a million counts to compute.
Is this possible? I ran into an error saying that I exceeded the bucket limit of 10000. I foresee that this problem will not be solved just by increasing the bucket limit..
You can do this two ways. FYI, the bucket limit is 65k in the most recent versions of the stack.
You can use a Terms aggregation on user ID, and then a cardinality aggregation on the date field. Because dates represent milliseconds since the epoch it will match milliseconds. You could use a scripted cardinality instead if you want to round, but that is the slowest calculation.
Thanks the first method worked for me. I noticed that the number of user ids that were returned, is dependent on the "size" parameter under the terms aggregrator. And increasing the value will cause me to exceed the bucket limit. Do you recommend the approach of increasing the max bucket limit whenever I require more results?
That's how the terms aggregation works: it shows the most frequently-occuring values, but your dataset seems to be equally-weighted, which is not a good fit. The transform option will be able to calculate uniqueness for all values.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.