How do I aggregate across subsearches?

A certain periodic task is supposed to happen once a minute. I send an alert when this is hanging using the following condition in a Threshold Alert.

WHEN count() GROUPED OVER top 1 'myPeriodicTaskLog' 
IS BELOW 1 FOR THE LAST 2 minutes

My complication: This task is occurring separately in multiple Docker Instances, and I want to check that none of them is blocked.

I want to say "The key myPeriodicTaskLog must occur each minute in each instance. Otherwise send an alert."

I have the field instance_name. Each instance's name is assigned pseudorandomly on each deployment (i.e., something like "a58hgh12g2"). So, I cannot code the condition to include these names as literals but can use these values to aggregate.

How do I group in this way?

Hi @Joshua_Fox,

Do you happen to know the exact number of instances available? Or is that something that can change over time?

It is usually two but we might scale up to three.

if a solution depends on having a constant number, that could be good enough.

Check out this watch. It will ensure none of the nodes in your cluster are down (but it assumes you know the node count). This might work for you:

{
  "trigger": {
    "schedule": {
      "interval": "1m"
    }
  },
  "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          ".monitoring-es-6*"
        ],
        "types": [],
        "body": {
          "size": 1,
          "sort": [
            {
              "timestamp": {
                "order": "desc"
              }
            }
          ],
          "_source": "cluster_stats.nodes.count.total",
          "query": {
            "bool": {
              "filter": {
                "term": {
                  "type": "cluster_stats"
                }
              }
            }
          }
        }
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.hits.hits.0._source.cluster_stats.nodes.count.total": {
        "lt": 3
      }
    }
  },
  "actions": {
    "my-logging-action": {
      "logging": {
        "level": "error",
        "text": "Node is down! We only detect {{ctx.payload.hits.hits.0._source.cluster_stats.nodes.count.total}} nodes."
      }
    },
    "send_email": {
      "email": {
        "profile": "standard",
        "to": [
          "steve@steve.com"
        ],
        "subject": "Steve! A node in your cluster is down!",
        "body": {
          "text": "We only detect {{ctx.payload.hits.hits.0._source.cluster_stats.nodes.count.total}} nodes."
        }
      }
    },
  }
}
1 Like

Thank you. That query does not mention myPeriodicTaskLog . So it seems that it is tracking node liveness, but not liveness of that thread on each node. Is that right?

I'd like to do

  1. node_list = query for a list of nodes that have been live at all in the last minute, based on instance_name field which occurs in each log line
  2. query for the presence of myPeriodicTaskLog in the last minute, grouped by node for each node in node_list. If that query does not return at least 1 value for each node, send an alert.

is this feasible?

It should be.

The following watch will look at packetbeat data by grouping all data by ip address, then using that grouping, determine if any of the documents in each ip bucket are missing a response code. If so, it will fire an alert. This feels similar to what you're doing so hopefully this will help. One thing I recommend is writing an ES query that will actually detect the data you are hoping to use in the condition. If you can do that, you can definitely create a watch for it.

{
  "trigger": {
    "schedule": {
      "interval": "10s"
    }
  },
  "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          "packetbeat-*"
        ],
        "types": [],
        "body": {
          "size": 0,
          "query": {
            "match_all": {}
          },
          "aggs": {
            "unique_beat_names": {
              "terms": {
                "field": "ip",
                "size": 5
              },
              "aggs": {
                "response_code": {
                  "filter": {
                    "exists": {
                      "field": "dns.response_code"
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  },
  "condition": {
    "script": {
      "source": "ctx.vars.missing=false;for (def beat_name : ctx.payload.aggregations.unique_beat_names.buckets){if(beat_name.doc_count == 0 || beat_name.response_code.doc_count == 0){ctx.vars.missing=true;}}return ctx.vars.missing;",
      "lang": "painless"
    }
  },
  "actions": {
    "my-logging-action": {
      "logging": {
        "level": "info",
        "text": "Oh yea"
      }
    }
  }
}
1 Like

Thank you. It looks like ctx.payload.aggregations.unique_beat_names.buckets gives the list of unique IPs (in our case, that will be Nodes/instances by instance_name rather than IP.)

Then , this script gives the boolean for an alert . Importantly. it looks like this is in the Painless scripting language which is rich enough to encode any needed logic

ctx.vars.missing = false;
for (def beat_name : ctx.payload.aggregations.unique_beat_names.buckets) {
    if (beat_name.doc_count == 0 || beat_name.response_code.doc_count == 0) {
      ctx.vars.missing=true;
    }
}
return ctx.vars.missing;`

Yup! Let me know if that helps!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.