BACKGROUND
I'm trying to determine the average number of images created for an event, defined by patient and body part, but also createTime, where an event's duration is derived from the createTimes that are "close" to one-another. So, for images something like (skipping id and other properties):
{"patientId": "1000", "bodyPart": "HEAD", "createTime": "2006-06-16T00:00:00.000Z" }
{"patientId": "1000", "bodyPart": "HEAD", "createTime": "2006-06-17T00:00:00.000Z" }
{"patientId": "1000", "bodyPart": "FOOT", "createTime": "2006-06-17T00:00:00.000Z" }
{"patientId": "1000", "bodyPart": "HEAD", "createTime": "2006-07-17T00:00:00.000Z" }
{"patientId": "1000", "bodyPart": "HEAD", "createTime": "2006-08-17T00:00:00.000Z" }
{"patientId": "1000", "bodyPart": "HEAD", "createTime": "2006-11-17T00:00:00.000Z" }
{"patientId": "2000", "bodyPart": "HEAD", "createTime": "2006-06-16T00:00:00.000Z" }
{"patientId": "2000", "bodyPart": "HEAD", "createTime": "2006-08-16T00:00:00.000Z" }
And for a distance of 6 weeks, we'd get:
-
1000, HEAD, 2006-06-16: count: 4
-
1000, FOOT, 2006-06-17: count: 1
-
1000, HEAD, 2006-11-17: count: 1
-
2000, HEAD, 2006-06-16: count 1
-
2000, HEAD, 2006-08-16: count 1
Because patient 1000 has 4 HEAD images that form a chain of events. 6/16, 6/17,7/17,8/17, but the last HEAD is not close enough, and is therefore considered a separate event. I'm using the first createTime in a chain (along with patientId and bodyPart) to identify the event.
From this, we would want to further aggregate by body part:
HEAD (4 + 1 + 1 + 1) / 4 = 1.75
FOOT (1)/ 1 = 1
...
and maybe also do a histogram over time.
PROBLEM
The problem is, I'm not seeing any way to define events in elasticsearch. Scripted Metric Aggregation looked promising, but you can't use it as a pipeline for further processing, as far as I can make out. Plus I'm guessing it would be very inefficient. I found this issue about k-means clustering, which is in the ballpark, but ts been out there since 2014. Is there some approach that can be used in elasticsearch to do create buckets based on clusters of data (however you define the cluster)?