Bucket Selector Aggregation Script- access doc index field value


(Hemrajsinh Gharia) #1

Hello,

I am trying to provide a solution for this stackoverflow question.

`PUT http://localhost:9200/test_index/test/_mapping/`


{

"test": {
    "properties": {
        "date": {
            "type": "date",
            "format": "strict_date_optional_time||epoch_millis"
        },
        "status": {
            "type": "string",
            "index": "not_analyzed"
        },
        "version": {
            "type": "long"
        },
        "workFlowId": {
            "type": "long"
        }
    }
}	}

Then index all the data shown in question (total 10) one by one.

POST http://localhost:9200/test_index/test/1

{
"date" : "2015-11-01",
"workFlowId" : 1,
"version" : 1,
"status": "In Progress"
}

I tried Bucket Script Aggregation and Sub Aggregation as follow:

POST http://localhost:9200/test_index/test/_search?search_type=count

{

"aggs": {
    "per_day": {
        "date_histogram": {
            "field": "date",
            "interval": "day"
        },
        "aggs": {
            "per_status": {
                "terms": {
                    "field": "status"
                },
                "aggs": {
                    "max_version_per_workflow": {
                        "terms": {
                            "field": "workFlowId"
                        },
                        "aggs": {
                            "max_version": {
                                "max": {
                                    "field": "version"
                                }
                            },
                            "eod_bucket_filter": {
                                "bucket_selector": {
                                    "buckets_path": {
                                        "maxVersionPerWorkFlow": "max_version"
                                    },
                                    "script": "2 >= maxVersionPerWorkFlow"
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

}

Which is working fine I guess and giving expected results. For each day I need to find out total "In Progress" and "Completed" workflows considering only records that has largest version till that day. . Keeping that in mind, I am using bucket filter. But as you see, in script I used static value 2. Instead of that I need to use document's version value for comparison. Here doc['title'].value is not working. Any suggestion how I can achieve this?


Complicated filter/aggregation based on an ID excluding CREARED from REVIEVED alerts
Query strategy for stateful event timeseries
(Christian Dahlqvist) #2

Even though it most likely is possible to do this at query time in Elasticsearch, it may not scale very well as most, if not all, records need to be considered in the query. If you know some queries you want to run, it can sometimes be very beneficial to shift some of the work to index time instead and index the data in multiple ways to efficiently support different type of queries.

In this case you could create a separate workflow-centric index which holds the latest state for each workflow. If you want the latest state for each workflow irrespective of time this would be a single index, which you could use a time-based daily index if you want to get the correct status per day as in this case.

Whenever you receive a new status, you index the raw record into the current index. This allows you to track the progress and analyse the task transitions as you currently do. In this index you can typically let Elasticsearch assign the document ID as no updates will take place. In addition to this you also index the record into a workflow-centric index with a unique identifier, e.g. workflow id, as the document id. If several updates come in for the same workflow, each will result in an update and the latest state will be preserved. Running aggregations across this index to find the current or latest state will be considerably more efficient and scalable as you only have a single record per workflow and do not need to filter out documents based on relationships to other documents.

Please have a look at this presentation around entity-centric indexing, which explains this further and gives a few other examples.


(Hemrajsinh Gharia) #3

Thanks Christian. Following your suggestion, I tried to provide the solution. Check the same on this. Correct me if I am wrong.


(system) #4