ES query to find the earliest occurrence of message for each host

I am experienced with ELK but not so much with directly querying Elasticsearch and I need help to construct an Elasticsearch query. We are currently using Elasticsearch 1.4.4 and not using Watcher.

The data contains error messages from logs on multiple hosts. Quite often when an error occurs a message is written out many times until the issue is resolved. This means that there are identical documents from a particular host with only the timestamp changing.

I need the query to do the following:

  1. Search for a a particular phrase, "Assertion failure" within the time frame now-2h.
  2. For each host which has documents with this phrase only return the earliest result.
  3. Data returned for each of those hits should include all the data in that document. So we can't just return a list of hosts where this phrase has occurred, we need all the data since the exact error message will differ from host to host and we need that.

It's part 2 which is tripping me up. I don't know how, or if it's even possible, to make a query with that function. If there are features of a newer Elasticsearch version which can do this then we are able to update.

The reason why part 2 is important is because there could be a thousand or so of these messages but they only actually occur on a couple hosts.

Here is some simplified example sudo-data in Elasticsearch:

13:21 host1 Assertion failure 22334
13:22 host1 Assertion failure 22334
13:23 host1 Assertion failure 22334
13:45 host2 Assertion failure 111133
13:46 host2 Assertion failure 111133

And here is which documents should be returned after the query on that data:

13:21 host1 Assertion failure 22334
13:45 host2 Assertion failure 111133

Thank you for any help.

You could do this using a combination of the terms and top_hits aggregations and do something like the following:

{
  "size": 0,
  "query": {
    "filtered": {
      "query": {
        "match": {
          "message": "Assertion failure"
        }
      },
      "filter": {
        "range": {
          "FIELD": {
            "gte": "now-2h"
          }
        }
      }
    }
  },
  "aggs": {
    "hosts": {
      "terms": {
        "field": "host",
        "size": 10
      },
      "aggs": {
        "earliest_record": {
          "top_hits": {
            "size": 1,
            "sort": [
              {
                "date": {
                  "order": "asc"
                }
              }
            ]
          }
        }
      }
    }
  }
}

Here we are using the query to subset the documents to the ones you are interested in (those mentioning "Assertion failure" in the last 2 hours). Note that for this I have assumed you are searching in a field called message for the string. I have also set size to 0 so that no hits are returned since the information you want will be in the aggregations section.

Then we use the terms aggregation to bucket the matching documents by host. Note that as written this will return the top 10 hosts (ordered by descending document count). You could order the hosts by a nested min aggregation on the date field so you get the top 10 hosts who logged this message first. You could also change the size to include more hosts (say 100) if you expect more hosts to report the error but I would not recommend setting size to 0 as this can easily lead to memory problems if the number of hosts gets very high.

Then each host bucket we add a top_hits aggregation to give us the top document ordered by ascending date (I have assumed your date field is called date here).

The response will be more complex than just an array of hits but it should be easy enough to extract the information you need. If you are on 1.6 or later you could also use the response filter feature so the response only contains the bits that you actually want to use.

Hope that helps

Thanks Colin, I believe we're well on the way to getting this to work! Everything you said makes sense.

There are a couple issues. First, I'll post the query I'm currently using.

{
    "fields": [
        "@timestamp",
        "host"
    ],
    "size": 0,
    "query": {
        "filtered": {
            "query": {
                "match": {
                    "message": "\"Assertion failure\""
                }
            },
            "filter": {
                "range": {
                    "@timestamp": {
                        "gte": "now-2h"
                    }
                }
            }
        }
    },
    "aggs": {
        "hosts": {
            "terms": {
                "field": "host",
                "size": 10
            },
            "aggs": {
                "earliest_record": {
                    "top_hits": {
                        "size": 1,
                        "sort": [
                            {
                                "@timestamp": {
                                    "order": "asc"
                                }
                            }
                        ]
                    }
                }
            }
        }
    }
}

At the moment the query is returning results for "Assertion", "failure and "Assertion failure". I put the phrase in escaped quotes (as seen above) but that still doesn't seem to search for the exact phrase.

How do I select which fields to be returned in the aggregation? As seen above I'd like to only return host names and the timestamp, but this doesn't seem to affect the data returned.

The messages I'm looking for actually come in the form:

*Assertion failure*invalid connection*

Where the * can be any number of characters. I'm not sure of the best way to do that. Do I stick in a bool query with two matches for "Assertion failure" and "invalid connection" as musts? Or do I use a single match and use a wildcard? Or something else?

Even though you might think this is simple stuff your help has been invaluable to my learning. Thanks again!
Alex

1 Like

I think you are looking for phrase matching, have a look at this link for how to do that in your match query: Match query | Elasticsearch Guide [8.11] | Elastic

You can filter the fields returned in the _source of each hit with the Source Filtering feature in the same way as this can be used in the search hits. So your top_hits aggregation might then look something like the following:

"earliest_record": {
  "top_hits": {
    "size": 1,
    "sort": [
      {
        "@timestamp": {
          "order": "asc"
        }
      }
    ],
    "_source": {
      "include": "host"
    }
  }
}

Yep, I would do this in a bool query with two must clauses.

Happy to help :smile: