Counting buckets with a particular value

Using Elasticsearch 6.2.x, I am failing to write a particular query. My data contains information about when an actor commenced one of several consecutive phases:

curl -XPUT http://localhost:9200/phases

curl -XPUT http://localhost:9200/phases/_mapping/_doc -H'Content-Type:application/json' -d'{
  "properties": {
    "actor": { "type": "keyword" },
    "verb": { "type": "keyword" },
    "object": { "type": "keyword" }
  }
}'

curl -XPOST http://localhost:9200/phases/_doc -H'Content-Type:application/json' -d'{
  "actor": "actor1",
  "verb": "started",
  "object": "phase2",
  "timestamp": "2018-06-20T12:00:00.0000Z"
}'

curl -XPOST http://localhost:9200/phases/_doc -H'Content-Type:application/json' -d'{
  "actor": "actor1",
  "verb": "started",
  "object": "phase1",
  "timestamp": "2018-06-18T12:00:00.0000Z"
}'

curl -XPOST http://localhost:9200/phases/_doc -H'Content-Type:application/json' -d'{
  "actor": "actor2",
  "verb": "started",
  "object": "phase1",
  "timestamp": "2018-06-19T12:00:00.0000Z"
}'

What I'm trying to find out is how many actors are presently in which phase, where only the most recently started phase is considered for each actor (phase1:1, phase2:1 in the example). I know how to query for the phase each actor is presently in by combining a terms with a top_hits query as follows:

curl -XPOST http://localhost:9200/phases/_search -H'Content-Type:application/json' -d'{
  "aggs": {
    "_xs_valueagg": {
      "terms": { "field": "actor" },
      "aggs": {
        "latest": {
          "top_hits": {
            "size": 1,
            "sort": [
              { "timestamp": { "order": "desc" } }
            ],
            "_source": {
              "includes": [ "object" ]
            }
          }
        }
      }
    }
  }
}'

However, this gives results of the general form: actor1:phase2, actor2:phase1, etc. What I'm failing to achieve is to count the number of actors per phase across these buckets. Since this is a post-processing step on existing buckets I have attempted various forms of pipeline aggregations (stats and bucket_script in particular) but fail to (a) create an entirely new bucket structure and (b) count the occurrences of a categorical attribute.

Is there a way to query for the number of actors per phase, where only the most recently entered phase is considered per actor?

If you have a lot of actors and shards this is computationally hard - especially if you also use time-based indices and default routing of data to shards.
It might be worth considering entity centric indexing - maintaining a doc per actor or actor-session. The benefits are that as well as the actor_current_phase attribute you want you could also have attributes like actor_phase1_duration and actor_phase2_duration or actor_has_invalid_phase_order etc ready for fast analysis in Kibana

Thanks Mark, I had a feeling I needed to re-structure my data to support this query. Interesting to learn that there's a name for this type of structure and tools to support it. Thank you for making the re-indexing scripts available!

Just out of curiosity: is it correct that, given the current organization of the data, it's not even feasible to express this type of query using the query DSL (ignoring computational cost for a moment)? My reasoning being that I've "used up" my one level of bucketing for the TOP 1 part, and that pipeline aggregations can't express the type of bucket restructuring I need. Or am I missing a construct in the query DSL that would allow me to do just that.

For complex cases there is always the scripted aggregation but with an event-centric index and your sort of business question, there's always going to be a scalability limit to the number of entities you can consider in a single aggregation request.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.