I'd like to search in the last 5 minutes, for values in the err_msg field that occured for the first time ever. And repeat this search every 5 minutes, so it should be as efficient as possible
In a cluster with time based indices and lots of potential error types this will be hard. A “new” index will not have visibility of the content in old indices and vice versa
err_msg is a keyword, and it is only the first 160 chars of the original error message (.slice(0, 160)) . After having ran a stack for more than a month, I got less than 20 different err_msg with that query:
In which case something like this might work. This is finding the first uses of tags on StackOverflow (note there are thousands of tags so I limit them in this example using the include param)
I was still wondering if we could rather have a "2-level" query, like what I posted originally, but written in one query. Where the first level queries very recent errors in the last 5m, then the second level, will query for a possible second match for these error, before now-5m. Because this way seems more scalable, I think, since most of the time, there are no errors in the last 5m, and even the second level search can be efficient
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.