Pipelined Histogram Aggregation?


(Buck Evan) #1

Our raw data looks like so:

{"name": "job1", "status":"running"}
{"name": "job1", "status":"failed"}
{"name": "job1", "status":"running"}
{"name": "job1", "status":"passed"}
{"name": "job2", "status":"running"}
{"name": "job2", "status":"passed"}
{"name": "job3", "status":"running"}

From this, we'd like to summarize as: (a job that's failed and passed is considered passed)

{
    "running": {"count": 1, "example": "job3"},
    "passed": {"count": 2, "example": "job2"},
}

This is quite difficult (for us). We can write an aggregation to compute the "final" status of a job, but we haven't figured out how to compute the histogram under a pipelined aggregation.

Is this possible?
Can we get the same results in a simpler way?


Show only most recent / latest event?
(Mike Simos) #2

Hi,

Would you need to use a pipeline aggregation? You could do something like:

{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field": "name",
        "size": 10
      },
      "aggs": {
        "completed": {
          "filter": {
            "terms": {
              "status": [
                "failed",
                "passed"
              ]
            }
          }
        },
        "running": {
          "filter": {
            "term": {
              "status": "running"
            }
          }
        }
      }
    }
  }
}

Which results in a response like:

  "aggregations": {
    "jobs": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "job1",
          "doc_count": 4,
          "running": {
            "doc_count": 2
          },
          "completed": {
            "doc_count": 2
          }
        },
        {
          "key": "job2",
          "doc_count": 2,
          "running": {
            "doc_count": 1
          },
          "completed": {
            "doc_count": 1
          }
        },
        {
          "key": "job3",
          "doc_count": 1,
          "running": {
            "doc_count": 1
          },
          "completed": {
            "doc_count": 0
          }
        }
      ]
    }
  }

(Buck Evan) #3

Yes, this is the first-order aggregation that I mentioned. From this, you can see that there are three jobs, two of which have completed, one of which is running. Imagine that the number of jobs is very large (greater than 50k), and I want a summary like:

{
    "running": 1,
    "passed": 2,
}

This summary would need to be built on top of your first-order aggregation in order to avoid any double-counting, and to have correct status for each job. Further, I'd like to return one key from each bucket in this summary:

{
    "counts": {"running": 1, "passed": 2},
    "examples": {"running": "job3", "passed": "job2"},
}

(Christian Dahlqvist) #5

Doing this entirely at query time can get expensive at scale as you may need to go through a lot of data. If you have queries you always want to run based on the last/current status you can create a separate index that holds one document per ID (use ID as key). Whenever you get a status update you write the event into the existing indices as you currently do into, but at the same time you update the document belonging to the job in the new index with the current state. This gives you a much smaller index that holds just the last state and can be queried more efficiently even at large scale. This is basically a very simple entity-centric index.


(Buck Evan) #6

Thanks Christian. That's what we're working on implementing right now, as a workaround for not being able to express the above query.

The drawback, that I'd like to eliminate, is that we need to keep one index per "first order" aggregation that we want to support. This means we only support use cases that we've thought of way before the user wants them.

If I can do this without pre-aggregation, it means that I instead support arbitrary use cases.

Related discussions:


(system) #8