Histogram Aggregation Filter Nested Dates Array Per Interval

kigonya · August 21, 2018, 8:18pm

I have met a challenge that has made me review these forums and sign up :).

My work is related to cancer trials. We are currently on ES 5.5. We are trying to track numbers of trials that enter into different status over time. Each trial has an array of status history as shown below:

{
	"status_history": [{
		"comments": [],
		"message_datetime": "2001-04-20T00:00:00",
		"status": "ACTIVE",
		"status_datetime": "2001-04-20T00:00:00"
	}, {
		"comments": [],
		"message_datetime": "2009-08-12T00:00:00",
		"status": "CLOSED_TO_ACCRUAL",
		"status_datetime": "2009-08-12T00:00:00"
	}, {
		"comments": ["Trial completed prematurely."],
		"message_datetime": "2010-04-01T00:00:00",
		"status": "ADMINISTRATIVELY_COMPLETE",
		"status_datetime": "2010-04-01T00:00:00"
	}]
}

Such that when I am creating the histogram on status_history.status, I am expecting a single entry for for each occurrence in the qualifying date interval, but for each interval, because it aggregates the field for the trial its giving me 3 items each time instead of one for the relevant interval.

Query Example:

{
    	"date_histogram": {
    		"aggs": {
    			"status_history.status": {
    				"terms": {
    					"field": "status_history.status._raw",
    					"order": {
    						"_term": "asc"
    					},
    					"size": 50000
    				}
    			}
    		},
    		"date_histogram": {
    			"extended_bounds": {
    				"max": "2002-02-20T23:59:59.999999Z",
    				"min": "2001-01-01T00:00:00Z"
    			},
    			"field": "status_history.status_datetime",
    			"interval": "year",
    			"min_doc_count": 0
    		}
    	},
    	"status_history.status": {
    		"terms": {
    			"field": "status_history.status._raw",
    			"order": {
    				"_term": "asc"
    			},
    			"size": 50000
    		}
    	}
    }

Returns:

[{
    	"doc_count": 1,
    	"key": 978307200000,
    	"key_as_string": "2001-01-01T00:00:00.000Z",
    	"status_history.status": {
    		"buckets": [{
    			"doc_count": 1,
    			"key": "ACTIVE"
    		}, {
    			"doc_count": 1,
    			"key": "ADMINISTRATIVELY_COMPLETE"
    		}, {
    			"doc_count": 1,
    			"key": "CLOSED_TO_ACCRUAL"
    		}],
    		"doc_count_error_upper_bound": 0,
    		"sum_other_doc_count": 0
    	}
    }, {
    	"doc_count": 0,
    	"key": 1009843200000,
    	"key_as_string": "2002-01-01T00:00:00.000Z",
    	"status_history.status": {
    		"buckets": [],
    		"doc_count_error_upper_bound": 0,
    		"sum_other_doc_count": 0
    	}
    }]

It should return something like:

[{
        	"doc_count": 1,
        	"key": 978307200000,
        	"key_as_string": "2001-01-01T00:00:00.000Z",
        	"status_history.status": {
        		"buckets": [{
        			"doc_count": 1,
        			"key": "ACTIVE"
        		},],
        		"doc_count_error_upper_bound": 0,
        		"sum_other_doc_count": 0
        	}
        }, {
        	"doc_count": 0,
        	"key": 1009843200000,
        	"key_as_string": "2002-01-01T00:00:00.000Z",
        	"status_history.status": {
        		"buckets": [],
        		"doc_count_error_upper_bound": 0,
        		"sum_other_doc_count": 0
        	}
        }]

What else do I need to filter out the unqualifying dates per date interval?

I hope someone can help.

Thanks,
Peter

mikebd · August 22, 2018, 12:57am

Hello @kigonya.

Since you have an array of objects within each document, review:

kigonya · August 22, 2018, 3:19pm

I am going to keep reading and let you know if I figure it out.

But at the get go I'm not sure how those links will help in given scenario/example above. Note that if I am filtering for a static value across all entries in the bucket, that part is easy and we already use that type of filtering aggregates. But we are using status_history.status_datetime (status_history being an array) for the histogram but only want the items in the array that correspond to the key in the histogram... per entry in the histogram.... to get the change of status over time. Sounds like there should be a solution. If you format the json above, our objective will be clear.... I had a hard time trying to format it... I'll try again today. Thought it would be easy and planned to put the code in production yesterday but run into this seemingly solve-able roadblock. I appreciate your time.

kigonya · August 22, 2018, 3:25pm

I successfully formatted the json

kigonya · August 22, 2018, 8:41pm

I see someone with the same unanswered issue at Date range filtering does not affect date histogram aggregation Wondering if its an ES weakness.

kigonya · August 22, 2018, 9:18pm

Hello @mikebd
The docs are leading to something like:

{
	"date_histogram": {
		"aggs": {
			"transactions": {
				"aggs": {
					"status_history.status_nested": {
						"aggs": {
							"status_history.status": {
								"terms": {
									"field": "status_history.status._raw",
									"order": {
										"_term": "asc"
									},
									"size": 50000
								}
							}
						},
						"filter": {
							"range": {
								"status_history.status_datetime": {
									"format": "yyyy-MM-dd",
									"gte": "2001-01-01",   <==========should change per interval given ES set up
									"lte": "2018-12-01"   <==========should change per interval given ES set up
								}
							}
						}
					}
				},
				"nested": {
					"path": "status_history"
				}
			}
		},
		"date_histogram": {
			"extended_bounds": {
				"max": "2010-04-10T23:59:59.999999Z",
				"min": "2001-05-01T00:00:00Z"
			},
			"field": "status_history.status_datetime",
			"interval": "year",
			"min_doc_count": 0
		}
	},
	"status_history.status": {
		"terms": {
			"field": "status_history.status._raw",
			"order": {
				"_term": "asc"
			},
			"size": 50000
		}
	}
}

But as you will see this is insufficient because the values for gte and lte will need to change per interval in the buckets results. This seems like it should be so easy given the trial object above.

Matthew_Isett · September 4, 2018, 3:55pm

Peter --

I placed this doc into 6.X, and I am seeing the results differently?

I used the following mapping

{
"test-nested": {
"mappings": {
  "_doc": {
    "properties": {
      "status_history": {
        "type": "nested",
        "include_in_root": true,
        "properties": {
          "comments": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "message_datetime": {
            "type": "date"
          },
          "status": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "status_datetime": {
            "type": "date"
          }
        }
      }
    }
  }
}
}
}

And the results from this query, using _key as _term was dep in 6.X

GET test-nested/_search
{
 "size": 0, 
 "aggs" : {
	"date_histogram": {
		"aggs": {
			"status_history.status": {
				"terms": {
					"field": "status_history.status._raw",
					"order": {
						"_key": "asc"
					},
					"size": 50000
				}
			}
		},
		"date_histogram": {
			"extended_bounds": {
				"max": "2002-02-20T23:59:59.999999Z",
				"min": "2001-01-01T00:00:00Z"
			},
			"field": "status_history.status_datetime",
			"interval": "year",
			"min_doc_count": 0
		}
	}
 }
}

RESULT set:

"date_histogram": {
  "buckets": [
    {
      "key_as_string": "2001-01-01T00:00:00.000Z",
      "key": 978307200000,
      "doc_count": 1,
      "status_history.status": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": []
      }
    },
    {
      "key_as_string": "2002-01-01T00:00:00.000Z",
      "key": 1009843200000,
      "doc_count": 0,
      "status_history.status": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": []
      }
    },
    {
      "key_as_string": "2003-01-01T00:00:00.000Z",
      "key": 1041379200000,
      "doc_count": 0,
      "status_history.status": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": []
      }
    },
    {
      "key_as_string": "2004-01-01T00:00:00.000Z",
      "key": 1072915200000,
      "doc_count": 0,
      "status_history.status": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": []
      }
    },
    {
      "key_as_string": "2005-01-01T00:00:00.000Z",
      "key": 1104537600000,
      "doc_count": 0,
      "status_history.status": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": []
      }
    },
    {
      "key_as_string": "2006-01-01T00:00:00.000Z",
      "key": 1136073600000,
      "doc_count": 0,
      "status_history.status": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": []
      }
    },
    {
      "key_as_string": "2007-01-01T00:00:00.000Z",
      "key": 1167609600000,
      "doc_count": 0,
      "status_history.status": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": []
      }
    },
    {
      "key_as_string": "2008-01-01T00:00:00.000Z",
      "key": 1199145600000,
      "doc_count": 0,
      "status_history.status": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": []
      }
    },
    {
      "key_as_string": "2009-01-01T00:00:00.000Z",
      "key": 1230768000000,
      "doc_count": 1,
      "status_history.status": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": []
      }
    },
    {
      "key_as_string": "2010-01-01T00:00:00.000Z",
      "key": 1262304000000,
      "doc_count": 1,
      "status_history.status": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": []
      }
    }
  ]

Can you update to 6, I do not think there should be change between this functionality in aggs between the version, but I would like to see if you get the same results?

kigonya · September 5, 2018, 3:59am

I almost got excited.
Yes... I also got this far in 5.5.....But notice that although doc_count is 1. The buckets array is empty...

Unless, I'm misunderstanding something, it should be:

{
  "key_as_string": "2001-01-01T00:00:00.000Z",
  "key": 978307200000,
  "doc_count": 1,
  "status_history.status": {
    "doc_count_error_upper_bound": 0,
    "sum_other_doc_count": 0,
    "buckets": [{
		"doc_count": 1,
		"key": "ACTIVE"    <<<==== we need this
	}]
  }
}

This works well in our other filters. where we are not using an array.
Elastic Search seems to do a 2 step that is unfortunately independent.
STEP 1: Get histogram results
STEP 2: Aggregate the histogram results (regardless of histogram filtering)

system · October 3, 2018, 3:59am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Filter by histogram bucket key Elasticsearch	5	3168	October 24, 2017
Date Histogram for dynamic Date Ranges Elasticsearch	2	1792	March 15, 2017
Date histogram aggregation issue for arrays fields Elasticsearch	2	275	February 26, 2022
Date_histogram buckets not as expected Elasticsearch	10	908	March 30, 2017
Strange behavior with date histogram aggregation in a field that is inside a nested object (Elasticsearch 7.10.0) Elasticsearch	8	491	December 28, 2020

Histogram Aggregation Filter Nested Dates Array Per Interval

Related topics