How to sort aggregated based on sub aggregation's doc_count?

Hi,

I'm completely stuck trying to figure out how to sort "users" based on "status_codes" in the following query:

{
  
  "size": 0,
  "aggs": {
    "users": {
      "terms": {
        "field": "user_id"
      },
      "aggs": {
        "status_codes": {
          "filters": {
            "filters": {
              "200s": {
                "range": {
                  "status_code": {
                    "gte": 200,
                    "lte": 299
                  }
                }
              },
              "400s": {
                "range": {
                  "status_code": {
                    "gte": 400,
                    "lte": 499
                  }
                }
              },
              "500s": {
                "range": {
                  "status_code": {
                    "gte": 500,
                    "lte": 599
                  }
                }
              }
            }
          }
        },
        "bucket_truncate": {
          "bucket_sort": {
            "sort": [
           
                ],
           
            "from": 0,
            "size": 10
          }
        }
      }
    }
  }
}

I want "users", that have the most "400s", for example, to come first.

It's probably easier to flip things the other way - ie status codes at the top level aggregation and users as the aggregation underneath.

Will results answer the same question: give me the next 100 users starting from 400th user with their status_codes ordered by 200s/400s/500s ?

When you add deep pagination as a requirement and you’re talking about a distributed system that becomes a problem.

You may me better suited creating an entity-centric index from the log data to do user behaviour analysis. This can be done using transforms.

Understood! Can you please give an example of the query if we throw pagination out of the equation?

This is a non-trivial thing and I'm willing to learn how it's done. Maybe it will help me understand Elastic better.

what puzzles me out is the idea of flipping an aggregation. it seems counter-intuitive to me and I would love to understand why it's the advisable way to go.

It’s about working more easily with the natural sort order of ‘ terms’ aggs. They pick the most popular. With users at the top level you’ll find the most prolific user first and for each the most common status code for that user. With the status codes up top up you’ll get the most common code first (likely 200) and for each status code the most common user. So for 404s you’ll naturally get the users with the most 404s
This might highlight a user who has a lot of 404s but they also have many more 200s to make up for it (ie 404s are only 1% of their total traffic). To find users who have an unusually high mix of a status code (eg 90% of their traffic is 404s) simply use a ‘significant_terms’ agg instead of a ‘terms’ agg for the users.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.