Conditionally aggregating documents in an index?

I'm trying to conditionally aggregate documents in my index based on the existence of another document in the same set. Imagine the following series:

{"action": "pressed_submit", "user": 1, "timestamp": "2016-01-01 00:00:00", "meta": ...}
{"action": "answered_question", "user": 1, "timestamp": "2016-01-01 00:00:00", "meta": ...}
{"action": "answered_question", "user": 1, "timestamp": "2016-01-01 00:00:00", "meta": ...}
{"action": "filled_textbox", "user": 1, "timestamp": "2016-01-01 00:00:00", "meta": ...}
{"action": "pressed_submit", "user": 1, "timestamp": "2012-01-01 00:00:00", "meta": ...}

And then the following:

{"action": "answered_question", "user": 2, "timestamp": "2016-01-01 00:00:00", "meta": ...}
{"action": "answered_question", "user": 2, "timestamp": "2016-01-01 00:00:00", "meta": ...}
{"action": "filled_textbox", "user": 2, "timestamp": "2016-01-01 00:00:00", "meta": ...}
{"action": "pressed_submit", "user": 1, "timestamp": "2012-01-01 00:00:00", "meta": ...}

How can I find the last event done by each user that has not "pressed_submit"? Meaning, how can I find the last action by any user that hasn't pressed_submit? Note the timestamps.

I've been cracking my head around this for a while. While I've managed to solve the problem by querying ALL events and then filtering in Python code, it's very slow. Is there any way to use ElasticSearch's query engine to get results like that?

Behavioural analysis at scale typically requires an entity centric index - see https://twitter.com/elasticmark/status/1009380268409610240?s=21

I'm sorry, but I didn't understand your answer. I've watched the linked video, but I still don't understand what the centric index would be in this scenario.

Users and their last events?

That's already there. The JSON sample I sent is a single document type, in a single index.

Those documents are events.
An entity centric document would have the id field (think primary key) of the user id.

Is there any tutorial on how to create entity-centric indexes?

The video I shared along with the example code.

Yes, but the example implies having to re-index data. Unfortunately I do not control the application that writes to the ES index: it is proprietary third-party software.

The example advocates keeping your existing event-centric index and building a secondary entity-centric index from it.

But it revolves around the idea of having to recreate the "reviewers" index, or update it, with a cron job or everytime I need a report. So, whenever I need to run my query, I have to make sure that buildEntities.sh was ran recently. Correct?

That is, effectively, slower than processing the data in my programming language of choice.

At 31:30 in the video I talk about incremental updates

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.