Pipelined Histogram Aggregation?

Buck_Evan · October 24, 2016, 5:39pm

Our raw data looks like so:

{"name": "job1", "status":"running"}
{"name": "job1", "status":"failed"}
{"name": "job1", "status":"running"}
{"name": "job1", "status":"passed"}
{"name": "job2", "status":"running"}
{"name": "job2", "status":"passed"}
{"name": "job3", "status":"running"}

From this, we'd like to summarize as: (a job that's failed and passed is considered passed)

{
    "running": {"count": 1, "example": "job3"},
    "passed": {"count": 2, "example": "job2"},
}

This is quite difficult (for us). We can write an aggregation to compute the "final" status of a job, but we haven't figured out how to compute the histogram under a pipelined aggregation.

Is this possible?
Can we get the same results in a simpler way?

msimos · October 24, 2016, 8:40pm

Hi,

Would you need to use a pipeline aggregation? You could do something like:

{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field": "name",
        "size": 10
      },
      "aggs": {
        "completed": {
          "filter": {
            "terms": {
              "status": [
                "failed",
                "passed"
              ]
            }
          }
        },
        "running": {
          "filter": {
            "term": {
              "status": "running"
            }
          }
        }
      }
    }
  }
}

Which results in a response like:

  "aggregations": {
    "jobs": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "job1",
          "doc_count": 4,
          "running": {
            "doc_count": 2
          },
          "completed": {
            "doc_count": 2
          }
        },
        {
          "key": "job2",
          "doc_count": 2,
          "running": {
            "doc_count": 1
          },
          "completed": {
            "doc_count": 1
          }
        },
        {
          "key": "job3",
          "doc_count": 1,
          "running": {
            "doc_count": 1
          },
          "completed": {
            "doc_count": 0
          }
        }
      ]
    }
  }

Buck_Evan · October 24, 2016, 11:15pm

Yes, this is the first-order aggregation that I mentioned. From this, you can see that there are three jobs, two of which have completed, one of which is running. Imagine that the number of jobs is very large (greater than 50k), and I want a summary like:

{
    "running": 1,
    "passed": 2,
}

This summary would need to be built on top of your first-order aggregation in order to avoid any double-counting, and to have correct status for each job. Further, I'd like to return one key from each bucket in this summary:

{
    "counts": {"running": 1, "passed": 2},
    "examples": {"running": "job3", "passed": "job2"},
}

Christian_Dahlqvist · October 27, 2016, 7:55pm

Doing this entirely at query time can get expensive at scale as you may need to go through a lot of data. If you have queries you always want to run based on the last/current status you can create a separate index that holds one document per ID (use ID as key). Whenever you get a status update you write the event into the existing indices as you currently do into, but at the same time you update the document belonging to the job in the new index with the current state. This gives you a much smaller index that holds just the last state and can be queried more efficiently even at large scale. This is basically a very simple entity-centric index.

Buck_Evan · October 27, 2016, 8:11pm

Thanks Christian. That's what we're working on implementing right now, as a workaround for not being able to express the above query.

The drawback, that I'd like to eliminate, is that we need to keep one index per "first order" aggregation that we want to support. This means we only support use cases that we've thought of way before the user wants them.

If I can do this without pre-aggregation, it means that I instead support arbitrary use cases.

Related discussions:

Topic		Replies	Views
Pipeline cumulative aggregation just works for Histogram Elasticsearch	0	451	December 12, 2017
Kibana Visualization filtering data based on aggregation Kibana	0	338	December 13, 2021
Pipeline aggregation: full histogram of an aggregation Elasticsearch	4	981	December 4, 2016
Is this possible with pipeline aggregations Elasticsearch	3	393	April 27, 2018
Searching aggregation to calculate the status of running processes Kibana	4	2086	March 8, 2018

Pipelined Histogram Aggregation?

Related topics