Optimizing using sample aggregation

Yonatan_Omer · September 18, 2019, 7:30am

I have an index where each doc contains all user's-weekly events.
each event has messages.txt , which contain whole line strings, some very long (mapping bellow).
The following query takes 15 seconds to return, the "sampler" aggregation does not help.
Is there a way to limit the number of documents which are sent into the agg?

GET /users_weekly_events/_search
{
  "size": 0,
  "query": {...},
  "aggs": {
      "sample": {
          "sampler": {
             "shard_size": 5
          },
          "aggs": {
              "keywords": {
                  "terms": {
                    "field": "messages.txt.keyword",
                    "size": 5                 
                  }
              }
          }
      }
   }  
}

mappings of the text field

          "messages" : {
            "properties" : {
              "txt" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "type" : "keyword",
                    "ignore_above" : 256
                  }
                }
              }
            }
          },

Mark_Harwood · September 18, 2019, 9:47am

I have a number of questions:

What is the query? How many docs in the index? How many nodes?
Isn't the top_hits aggregation on it's own more appropriate than the sampler/terms agg combo you're using here?

Yonatan_Omer · September 18, 2019, 10:30am

I have around 200k docs, 5 shards.
Each doc contains an array with up to 10k strings (weekly events of this user)+ some stats (about 2 MB). the @timestamp is the beginning of the week time window.
The query is different for every search, the query along perform very fast.
So you suggest that I should use the 'top_hits' to return the newest using the @timestamp fields?

GET /users_weekly_events/_search
{
  "size": 0,
  "query": {
    "bool": {
      "must": [ 
           {"term": {"osVersion": 10},  {"term": {"foo": "bar"} }
      ]
    }
  },
  "aggs": {
    "event_buckets": {
      "terms": {
        "field": "messages.txt.keyword",
        "include": "some-prefix-of-the-event.*",
        "size": 5
      }
    }
  }
}

Yonatan_Omer · September 18, 2019, 11:59am

Could you please show how to use the top_hits agg as a parent agg for the "event_buckets" agg in my example above?
My documents contain a @timestamp field, i could sort by that, but ideally would pick just random top

Mark_Harwood · September 18, 2019, 12:42pm

Thanks for the additional info.

What I don't like about large text fields as keywords is the index overheads and the arbitrary loss of data for those strings exceeding your ignore_above setting.
It's hard to know what solution to suggest without a full grasp of what business question you're trying to answer

Yonatan_Omer · September 18, 2019, 1:29pm

I agree that long strings are an issue. I think the top_hits may help, but i need to clarify the syntax.
my use case is as follows:
event lines from log files are bucketed using their common string prefix.
for each user (_doc) i'm storing an array these event-id's, and an array of full events.
The _doc contain all the events during the passed week.
Event ID's are used for significant-terms aggs ("find unusual events for a subset of all users"),which is working very well.
What i'm trying to achieve is, for a given event-id, return the top 5 occurrences of the full message.

So for event ID "CURL Failed with err code:"

I would get : ["CURL Failed with err code:404", "CURL Failed with err code:123"...]

The following query does the job, only that it takes too long.
Is it possible to use top_hits to limit the number of docs which are sent to the second agg ?

GET /users_weekly_events/_search
{
  "size": 0,
  "query": {
    "bool": {
      "must": [ 
           {"term": {"osVersion": 10},  {"term": {"foo": "bar"} }
      ]
    }
  },
  "aggs": {
    "event_buckets": {
      "terms": {
        "field": "messages.txt.keyword",
        "include": "some-prefix-of-the-event.*",
        "size": 5
      }
    }
  }
}

Mark_Harwood · September 19, 2019, 8:09am

I expect a more useful strategy might be to avoid aggregations based on keyword fields with large strings and instead use hashed versions of these strings. Obviously users will not be able to understand these values so you'd have to issue a second query to get the related full-text but it does mean you'd be dealing with shorter strings

Yonatan_Omer · September 19, 2019, 8:34am

Our current implementation is working very well, we now only wish to tune this feature. is it possible to use the output of top_hits as an input to the next aggregation in the pipeline?

And yes, I do consider to use hashes and store the long strings in another index, but in our use case it's not trivial:
200k _docs (users) each _doc has 1k of eventIDs's and 10k of distinct messages.
eventID is a keyword and is always the prefix of each full message.

eventID used to query: find unusual eventID for users having eventId=X
messages used to make match_phrase queries. messages.keyword are used to aggregate distinct messages.

I tried nested aggs, but we reached the 10000 nested objects limit...it was also very very slow

Mark_Harwood · September 19, 2019, 8:37am

No. Top hits is used as a leaf-node, generally to give more detail on the parent buckets discovered.

system · October 17, 2019, 8:37am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Sampler aggregation fails to optimize queries Elasticsearch	8	414	November 4, 2019
Aggregation query size? Elasticsearch	5	6317	July 5, 2017
Query Optimization Elasticsearch	2	437	November 4, 2020
Large buckets aggregations Elasticsearch	3	3036	July 5, 2017
Sampler aggregation performance vs 2 queries Elasticsearch	5	910	January 18, 2018

Optimizing using sample aggregation

Related topics