Aggregating on related fields across documents

BACKGROUND
I'm trying to determine the average number of images created for an event, defined by patient and body part, but also createTime, where an event's duration is derived from the createTimes that are "close" to one-another. So, for images something like (skipping id and other properties):

{"patientId": "1000", "bodyPart": "HEAD", "createTime":  "2006-06-16T00:00:00.000Z" }
{"patientId": "1000", "bodyPart": "HEAD", "createTime":  "2006-06-17T00:00:00.000Z" }
{"patientId": "1000", "bodyPart": "FOOT", "createTime":  "2006-06-17T00:00:00.000Z" }
{"patientId": "1000", "bodyPart": "HEAD", "createTime":  "2006-07-17T00:00:00.000Z" }
{"patientId": "1000", "bodyPart": "HEAD", "createTime":  "2006-08-17T00:00:00.000Z" }
{"patientId": "1000", "bodyPart": "HEAD", "createTime":  "2006-11-17T00:00:00.000Z" }

{"patientId": "2000", "bodyPart": "HEAD", "createTime":  "2006-06-16T00:00:00.000Z" }
{"patientId": "2000", "bodyPart": "HEAD", "createTime":  "2006-08-16T00:00:00.000Z" }

And for a distance of 6 weeks, we'd get:

  • 1000, HEAD, 2006-06-16: count: 4

  • 1000, FOOT, 2006-06-17: count: 1

  • 1000, HEAD, 2006-11-17: count: 1

  • 2000, HEAD, 2006-06-16: count 1

  • 2000, HEAD, 2006-08-16: count 1

Because patient 1000 has 4 HEAD images that form a chain of events. 6/16, 6/17,7/17,8/17, but the last HEAD is not close enough, and is therefore considered a separate event. I'm using the first createTime in a chain (along with patientId and bodyPart) to identify the event.

From this, we would want to further aggregate by body part:
HEAD (4 + 1 + 1 + 1) / 4 = 1.75
FOOT (1)/ 1 = 1
...
and maybe also do a histogram over time.

PROBLEM
The problem is, I'm not seeing any way to define events in elasticsearch. Scripted Metric Aggregation looked promising, but you can't use it as a pipeline for further processing, as far as I can make out. Plus I'm guessing it would be very inefficient. I found this issue about k-means clustering, which is in the ballpark, but ts been out there since 2014. Is there some approach that can be used in elasticsearch to do create buckets based on clusters of data (however you define the cluster)?

I think this would require a form of entity-centric index built from your original docs.
In this case the entity might be called “complaint” and there is no single common grouping key as in the examples in my link. You’d need to sort the original docs by patientId, bodyPart and time. This sorted stream could be provided by paging through the ‘scroll’ api or a ‘composite’ aggregation.
Your client code would group multiple patientId + BodyPart docs together into a single “complaint” doc where the dates between an image and the last image is below a threshold (eg a week). The complaint docs are then indexed with the appropriate attributes (patientId bodyPart numImages, firstDate lastDate duration...)

Actually, I had just watched the video you linked to in your tweet the day I posted it, and was considering making the event (or complaint, as you called it) an entity. A complication is that the threshold (1d, 1 week, 1 month) is ideally configurable per query. (A fact that was in my head, but somehow didn't quite make it into my post.) Moreover, there's likely to be other issues, like conditionally determining which documents are part of an event. Ideally, these would be determined at query time. It sounds like this can't be done at query time (and again, probably would be very inefficient, and a bad idea, even if it could). Sounds like the best approach may be to provide entity indexes for some pre-defined cases (e.g. 1 week, 1 month, 6month, 1 year), and provide a facility for building new indexes for special cases, as needed.

I opted for "complaint" as the entity name because we tend to use "event" to describe the source logs from which entities are typically built. Calling an entity "event" would be confusing (to me at least :slight_smile: ).

Yeah - each of these could be built from the same single pass over the sorted event stream - just outputting complaint docs based on different time threshold breaks.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.