Significant Terms aggregation around some specific event

(Nikhil Utane) #1


I am doing log analytics to debug an issue where few of our devices are randomly going for reboot. I have the uptime parameter which tell me when the device rebooted. I want to find any interesting event that occurs prior to any reboot. I suppose I can use "Significant Terms Aggregation" here. For that, I would like to create a foreground set that filters events to "All Log messages X minutes prior to the reboot event". I can then apply Significant Term on actual Log message. Is this the right approach? If so, how can I write the above query?


(Mark Harwood) #2

If you're looking at a text field then you can use the new significant text aggregation in version 6.
Unlike significant_terms agg on free text, it doesn't rely on loading all documents' values into RAM. An example:

GET tweets/_search
  "size": 0,
  "query": {
	"range": {
	  "date": {
		"gte": "20171124 16:50:00",
		"lte": "20171124 17:15:00"
  "aggs": {
	"sample": {
	  "sampler": {
		"shard_size": 10000
	  "aggs": {
		"sigText": {
		  "significant_text": {
			"field": "text",
			"shard_min_doc_count": 3,
			"min_doc_count": 3,
			"size": 20,
			"shard_size": 200

I find shingles of size 2 can work well on text.
If you have many shards this makes finding low-frequency things hard. If you're looking for something that occurs maybe only 3 times and you have 5 shards then the chances are it only occurs once on each shard. That means you'd have to lower shard_min_doc_count to 1 to consider every term and increase shard_size to ship large numbers of terms back to the coordinating node for consideration. This increases time and memory.

(Nikhil Utane) #3

Thanks Mark for the quick response.

You have touched upon quite a few concepts which are new to me so I'll go back and understand those in greater detail.

In your example you have specified a hard-coded time interval. So I suppose I can always run a query, get the timestamps corresponding to reboot event and then generate the above query for each of those timestamp.

Is it possible to accomplish all this within the same query?
For e.g.

  1. Run query A on index X and get all timestamps T when some problem occurred.
  2. Run query B on index Y and get all documents which are 5 mins prior to each of the timestamp T
  3. Run Significant Text or Terms query on the above output where the background set is full index Y and foreground set is the filtered output.

Sounds complicated but I'd expect this should be a fairly common ask, isn't it?


(Mark Harwood) #4

No - the steps are as you outlined but steps 2 and 3 are one query.
In my example I had one "event window" but yours will be multiple windows so you'd need to replace my single range query with a bool query that has a should array to act as a container for multiple range queries. This will logically "OR" the clauses so effectively range1 OR range2 OR ...

(Nikhil Utane) #5

Thank You. Will try it out.

(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.