How to search for a first occurrence of a term


(Cyril Auburtin) #1

I'd like to search in the last 5 minutes, for values in the err_msg field that occured for the first time ever. And repeat this search every 5 minutes, so it should be as efficient as possible

I wonder how to shape this in one query

So far I've been doing:

GET /filebeat*/_search?size=0
{
  "query": {
    "bool": {
      "filter": [
        {
          "match": {
            "stream": "stderr"
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": "now-5m"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "errors": {
      "terms": {
        "field": "err_msg",
        "size": 10
      }
    }
  }
}

followed by multiple queries for each err_msg in the response, then keeping only the err_msg with no hits

GET /filebeat*/_search?size=0
{
  "query": {
    "bool": {
      "filter": [
        {
          "match": {
            "stream": "stderr"
          }
        },
        {
          "match": {
            "err_msg": err_msg
          }
        },
        {
          "range": {
            "@timestamp": {
              "lt": "now-1d"
            }
          }
        }
      ]
    }
  }
}

It feels like it could be in one query, that's why I'm asking for a bit of help

I don't think it has to be an aggregation, a search could work, but I don't know how, maybe as a scripted search?


(Mark Harwood) #2

In a cluster with time based indices and lots of potential error types this will be hard. A “new” index will not have visibility of the content in old indices and vice versa


(Cyril Auburtin) #3

err_msg is a keyword, and it is only the first 160 chars of the original error message (.slice(0, 160)) . After having ran a stack for more than a month, I got less than 20 different err_msg with that query:

GET /filebeat*/_search?size=0
{
  "query": {
    "bool": {
      "filter": [
        {
          "match": {
            "stream": "stderr"
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": "now-300d"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "errors": {
      "terms": {
        "field": "err_msg"
      }
    }
  }
}

(Mark Harwood) #4

In which case something like this might work. This is finding the first uses of tags on StackOverflow (note there are thousands of tags so I limit them in this example using the include param)

GET so/_search
{
  "size": 0,
  "aggs": {
	"tag": {
	  "terms": {
		"field": "tag",
		"include": [
		  "logstash",
		  "java",
		  "kibana"
		],
		"order": {
		  "firstSeen": "asc"
		}
	  },
	  "aggs": {
		"firstSeen": {
		  "min": {
			"field": "creationDate"
		  }
		}
	  }
	}
  }
}

Your client would have to do the work to filter out the dates > 5 minutes ago but the bulk of the heavy lifting is done in this request.


(Cyril Auburtin) #5

Thanks your suggestion works

GET /filebeat*/_search?size=0
{
  "query": {
    "bool": {
      "filter": [
        {
          "match": {
            "stream": "stderr"
          }
        }
      ]
    }
  },
  "aggs": {
    "errors": {
      "terms": {
        "field": "err_msg",
        "order": {
          "firstSeen": "asc"
        }
      },
      "aggs": {
        "firstSeen": {
          "min": {
            "field": "@timestamp"
          }
        }
      }
    }
  }
}

I was still wondering if we could rather have a "2-level" query, like what I posted originally, but written in one query. Where the first level queries very recent errors in the last 5m, then the second level, will query for a possible second match for these error, before now-5m. Because this way seems more scalable, I think, since most of the time, there are no errors in the last 5m, and even the second level search can be efficient


(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.