Pipeline aggregations on millions unique terms

Our index holds log events from 100000 clients, mixed in a flat timeline, each doc looks like:
{
"@timestamp": "...",
"eventId": "Failed to start process foo.exe",
"clientId": "f68830d5-b1bf-45d6-b54b-abbf2438b709",
}

I'm trying to write a significant terms query, which should answer:

"find unusual eventId's for all clientId's having a given error".

The missing part in my following query, is to convert the filter query, to the list of all clientId terms which contains the searched error.
Can this be done by using pipeline aggregations, given the fact that the filter query can produce thousands of unique clientID's?

Should I rather hold another index, where each doc will have all events per 'clientId' ?

GET /events/_search
{
  "size" : 0,
  "query" : {
    "terms" : { "eventId.keyword": ["Failed to start process dumper at"]
    }
  },
  "aggregations": {
    "top_unusual_errors": {
      "significant_terms": {
        "field": "eventId.keyword",
        "size" : 10
      }
    }
  }
}

Yes, that would be the more scalable approach. Example scripts and walk-through here

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.