Histogram Aggregation Filter Nested Dates Array Per Interval

I have met a challenge that has made me review these forums and sign up :).

My work is related to cancer trials. We are currently on ES 5.5. We are trying to track numbers of trials that enter into different status over time. Each trial has an array of status history as shown below:

{
	"status_history": [{
		"comments": [],
		"message_datetime": "2001-04-20T00:00:00",
		"status": "ACTIVE",
		"status_datetime": "2001-04-20T00:00:00"
	}, {
		"comments": [],
		"message_datetime": "2009-08-12T00:00:00",
		"status": "CLOSED_TO_ACCRUAL",
		"status_datetime": "2009-08-12T00:00:00"
	}, {
		"comments": ["Trial completed prematurely."],
		"message_datetime": "2010-04-01T00:00:00",
		"status": "ADMINISTRATIVELY_COMPLETE",
		"status_datetime": "2010-04-01T00:00:00"
	}]
}

Such that when I am creating the histogram on status_history.status, I am expecting a single entry for for each occurrence in the qualifying date interval, but for each interval, because it aggregates the field for the trial its giving me 3 items each time instead of one for the relevant interval.

Query Example:

{
    	"date_histogram": {
    		"aggs": {
    			"status_history.status": {
    				"terms": {
    					"field": "status_history.status._raw",
    					"order": {
    						"_term": "asc"
    					},
    					"size": 50000
    				}
    			}
    		},
    		"date_histogram": {
    			"extended_bounds": {
    				"max": "2002-02-20T23:59:59.999999Z",
    				"min": "2001-01-01T00:00:00Z"
    			},
    			"field": "status_history.status_datetime",
    			"interval": "year",
    			"min_doc_count": 0
    		}
    	},
    	"status_history.status": {
    		"terms": {
    			"field": "status_history.status._raw",
    			"order": {
    				"_term": "asc"
    			},
    			"size": 50000
    		}
    	}
    }

Returns:

[{
    	"doc_count": 1,
    	"key": 978307200000,
    	"key_as_string": "2001-01-01T00:00:00.000Z",
    	"status_history.status": {
    		"buckets": [{
    			"doc_count": 1,
    			"key": "ACTIVE"
    		}, {
    			"doc_count": 1,
    			"key": "ADMINISTRATIVELY_COMPLETE"
    		}, {
    			"doc_count": 1,
    			"key": "CLOSED_TO_ACCRUAL"
    		}],
    		"doc_count_error_upper_bound": 0,
    		"sum_other_doc_count": 0
    	}
    }, {
    	"doc_count": 0,
    	"key": 1009843200000,
    	"key_as_string": "2002-01-01T00:00:00.000Z",
    	"status_history.status": {
    		"buckets": [],
    		"doc_count_error_upper_bound": 0,
    		"sum_other_doc_count": 0
    	}
    }]

It should return something like:

[{
        	"doc_count": 1,
        	"key": 978307200000,
        	"key_as_string": "2001-01-01T00:00:00.000Z",
        	"status_history.status": {
        		"buckets": [{
        			"doc_count": 1,
        			"key": "ACTIVE"
        		},],
        		"doc_count_error_upper_bound": 0,
        		"sum_other_doc_count": 0
        	}
        }, {
        	"doc_count": 0,
        	"key": 1009843200000,
        	"key_as_string": "2002-01-01T00:00:00.000Z",
        	"status_history.status": {
        		"buckets": [],
        		"doc_count_error_upper_bound": 0,
        		"sum_other_doc_count": 0
        	}
        }]

What else do I need to filter out the unqualifying dates per date interval?

I hope someone can help.

Thanks,
Peter

Hello @kigonya.

Since you have an array of objects within each document, review:

I am going to keep reading and let you know if I figure it out.

But at the get go I'm not sure how those links will help in given scenario/example above. Note that if I am filtering for a static value across all entries in the bucket, that part is easy and we already use that type of filtering aggregates. But we are using status_history.status_datetime (status_history being an array) for the histogram but only want the items in the array that correspond to the key in the histogram... per entry in the histogram.... to get the change of status over time. Sounds like there should be a solution. If you format the json above, our objective will be clear.... I had a hard time trying to format it... I'll try again today. Thought it would be easy and planned to put the code in production yesterday but run into this seemingly solve-able roadblock. I appreciate your time.

I successfully formatted the json :blush::grinning:

I see someone with the same unanswered issue at Date range filtering does not affect date histogram aggregation Wondering if its an ES weakness.

Hello @mikebd
The docs are leading to something like:

{
	"date_histogram": {
		"aggs": {
			"transactions": {
				"aggs": {
					"status_history.status_nested": {
						"aggs": {
							"status_history.status": {
								"terms": {
									"field": "status_history.status._raw",
									"order": {
										"_term": "asc"
									},
									"size": 50000
								}
							}
						},
						"filter": {
							"range": {
								"status_history.status_datetime": {
									"format": "yyyy-MM-dd",
									"gte": "2001-01-01",   <==========should change per interval given ES set up
									"lte": "2018-12-01"   <==========should change per interval given ES set up
								}
							}
						}
					}
				},
				"nested": {
					"path": "status_history"
				}
			}
		},
		"date_histogram": {
			"extended_bounds": {
				"max": "2010-04-10T23:59:59.999999Z",
				"min": "2001-05-01T00:00:00Z"
			},
			"field": "status_history.status_datetime",
			"interval": "year",
			"min_doc_count": 0
		}
	},
	"status_history.status": {
		"terms": {
			"field": "status_history.status._raw",
			"order": {
				"_term": "asc"
			},
			"size": 50000
		}
	}
} 

But as you will see this is insufficient because the values for gte and lte will need to change per interval in the buckets results. This seems like it should be so easy given the trial object above.

Peter --

I placed this doc into 6.X, and I am seeing the results differently?

I used the following mapping

{
"test-nested": {
"mappings": {
  "_doc": {
    "properties": {
      "status_history": {
        "type": "nested",
        "include_in_root": true,
        "properties": {
          "comments": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "message_datetime": {
            "type": "date"
          },
          "status": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "status_datetime": {
            "type": "date"
          }
        }
      }
    }
  }
}
}
}  

And the results from this query, using _key as _term was dep in 6.X

GET test-nested/_search
{
 "size": 0, 
 "aggs" : {
	"date_histogram": {
		"aggs": {
			"status_history.status": {
				"terms": {
					"field": "status_history.status._raw",
					"order": {
						"_key": "asc"
					},
					"size": 50000
				}
			}
		},
		"date_histogram": {
			"extended_bounds": {
				"max": "2002-02-20T23:59:59.999999Z",
				"min": "2001-01-01T00:00:00Z"
			},
			"field": "status_history.status_datetime",
			"interval": "year",
			"min_doc_count": 0
		}
	}
 }
}

RESULT set:

"date_histogram": {
  "buckets": [
    {
      "key_as_string": "2001-01-01T00:00:00.000Z",
      "key": 978307200000,
      "doc_count": 1,
      "status_history.status": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": []
      }
    },
    {
      "key_as_string": "2002-01-01T00:00:00.000Z",
      "key": 1009843200000,
      "doc_count": 0,
      "status_history.status": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": []
      }
    },
    {
      "key_as_string": "2003-01-01T00:00:00.000Z",
      "key": 1041379200000,
      "doc_count": 0,
      "status_history.status": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": []
      }
    },
    {
      "key_as_string": "2004-01-01T00:00:00.000Z",
      "key": 1072915200000,
      "doc_count": 0,
      "status_history.status": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": []
      }
    },
    {
      "key_as_string": "2005-01-01T00:00:00.000Z",
      "key": 1104537600000,
      "doc_count": 0,
      "status_history.status": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": []
      }
    },
    {
      "key_as_string": "2006-01-01T00:00:00.000Z",
      "key": 1136073600000,
      "doc_count": 0,
      "status_history.status": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": []
      }
    },
    {
      "key_as_string": "2007-01-01T00:00:00.000Z",
      "key": 1167609600000,
      "doc_count": 0,
      "status_history.status": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": []
      }
    },
    {
      "key_as_string": "2008-01-01T00:00:00.000Z",
      "key": 1199145600000,
      "doc_count": 0,
      "status_history.status": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": []
      }
    },
    {
      "key_as_string": "2009-01-01T00:00:00.000Z",
      "key": 1230768000000,
      "doc_count": 1,
      "status_history.status": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": []
      }
    },
    {
      "key_as_string": "2010-01-01T00:00:00.000Z",
      "key": 1262304000000,
      "doc_count": 1,
      "status_history.status": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": []
      }
    }
  ]

Can you update to 6, I do not think there should be change between this functionality in aggs between the version, but I would like to see if you get the same results?

I almost got excited. :wink:
Yes... I also got this far in 5.5.....But notice that although doc_count is 1. The buckets array is empty...

Unless, I'm misunderstanding something, it should be:

{
  "key_as_string": "2001-01-01T00:00:00.000Z",
  "key": 978307200000,
  "doc_count": 1,
  "status_history.status": {
    "doc_count_error_upper_bound": 0,
    "sum_other_doc_count": 0,
    "buckets": [{
		"doc_count": 1,
		"key": "ACTIVE"    <<<==== we need this
	}]
  }
}

This works well in our other filters. where we are not using an array.
Elastic Search seems to do a 2 step that is unfortunately independent.
STEP 1: Get histogram results
STEP 2: Aggregate the histogram results (regardless of histogram filtering)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.