How to search for a first occurrence of a term

I'd like to search in the last 5 minutes, for values in the err_msg field that occured for the first time ever. And repeat this search every 5 minutes, so it should be as efficient as possible

I wonder how to shape this in one query

So far I've been doing:

GET /filebeat*/_search?size=0
{
  "query": {
    "bool": {
      "filter": [
        {
          "match": {
            "stream": "stderr"
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": "now-5m"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "errors": {
      "terms": {
        "field": "err_msg",
        "size": 10
      }
    }
  }
}

followed by multiple queries for each err_msg in the response, then keeping only the err_msg with no hits

GET /filebeat*/_search?size=0
{
  "query": {
    "bool": {
      "filter": [
        {
          "match": {
            "stream": "stderr"
          }
        },
        {
          "match": {
            "err_msg": err_msg
          }
        },
        {
          "range": {
            "@timestamp": {
              "lt": "now-1d"
            }
          }
        }
      ]
    }
  }
}

It feels like it could be in one query, that's why I'm asking for a bit of help

I don't think it has to be an aggregation, a search could work, but I don't know how, maybe as a scripted search?

In a cluster with time based indices and lots of potential error types this will be hard. A “new” index will not have visibility of the content in old indices and vice versa

err_msg is a keyword, and it is only the first 160 chars of the original error message (.slice(0, 160)) . After having ran a stack for more than a month, I got less than 20 different err_msg with that query:

GET /filebeat*/_search?size=0
{
  "query": {
    "bool": {
      "filter": [
        {
          "match": {
            "stream": "stderr"
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": "now-300d"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "errors": {
      "terms": {
        "field": "err_msg"
      }
    }
  }
}

In which case something like this might work. This is finding the first uses of tags on StackOverflow (note there are thousands of tags so I limit them in this example using the include param)

GET so/_search
{
  "size": 0,
  "aggs": {
	"tag": {
	  "terms": {
		"field": "tag",
		"include": [
		  "logstash",
		  "java",
		  "kibana"
		],
		"order": {
		  "firstSeen": "asc"
		}
	  },
	  "aggs": {
		"firstSeen": {
		  "min": {
			"field": "creationDate"
		  }
		}
	  }
	}
  }
}

Your client would have to do the work to filter out the dates > 5 minutes ago but the bulk of the heavy lifting is done in this request.

1 Like

Thanks your suggestion works

GET /filebeat*/_search?size=0
{
  "query": {
    "bool": {
      "filter": [
        {
          "match": {
            "stream": "stderr"
          }
        }
      ]
    }
  },
  "aggs": {
    "errors": {
      "terms": {
        "field": "err_msg",
        "order": {
          "firstSeen": "asc"
        }
      },
      "aggs": {
        "firstSeen": {
          "min": {
            "field": "@timestamp"
          }
        }
      }
    }
  }
}

I was still wondering if we could rather have a "2-level" query, like what I posted originally, but written in one query. Where the first level queries very recent errors in the last 5m, then the second level, will query for a possible second match for these error, before now-5m. Because this way seems more scalable, I think, since most of the time, there are no errors in the last 5m, and even the second level search can be efficient

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.