The message field holds various messages, there are multiple messages per idNumber.
I would like to find the idNumbers for which there are two or more documents with message equal to "Hello" within one hour. How can I achieve that (if it's possible at all)?
I'm sure there's a better way of formatting the date but essentially I combine the ID and a string representation of an hour-level bucket. Obviously this doesn't spot 2 events close to an hour boundary e.g. 12:59 and 13:01 but might be good enough. If you have many unique IDs and they are spread across multiple shards then this will not scale and you need to think about indexing approaches that bring related data closer together.
Thanks, Mark. This is for reporting purposes, so speed is not so much of a requirement and the results can be approximate as long as I can get a good sample of occurrences. I'll try that out.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.