Significant Terms query with conditions in filter context

Is it possible to use significant terms query to answer queries that require to apply conditions in the filter context?

I'm trying to answer typical questions such as :

"For people with event A find unusual event which comes BEFORE A"

Significant terms aggregation is designed to diff 2 sets of documents. What you use to identify those documents is your choice.

That sentence sets off alarm bells for me. This is a question about a type of entity ("people") and their behaviours. This sort of analysis typically requires an entity-centric index where each doc represents a summary of an entity's behaviour. However most folks use elasticsearch to log data in an event-centric index. You may need to create an entity-centric index from your event-centric index to do this behavioural analysis.

Hi mark,
Many thanks for your response, I actually saw your video, and I already have two indexes, one is event-centric, the other entity-centric:
'events' : each doc is an event
'clients': each doc is a client with all it's events + stats

I'm currently using the 'clients' index to answer questions as:

For a subset of 'clients' having event X, find unusual events

This is working very nice, but i wish to improve it to:

For a subset of 'clients' having event X, find unusual events which occurred before X

The events in the 'clients' index are stored in array of objects.
The question is, how to find only unusual events which are found BEFORE the event used in the filter query (either by placement in the array, or by comparing a timestamp field).

So if I understand correctly it's a "what events are a precursor to event X?" type question - finding what might be the root cause of a problem (remembering correlation is not causation :slight_smile: ).
This will be a trickier one. If "event X" is always the last thing a client did you could also keep a field called "penultimateEvent" and look for significant terms in penultimateEvent where lastEvent was equal to X.
However if event X could occur anywhere in an array representing event histories then life gets more complex. In theory you could use a painless script to generate the foreground stats for precursor events - it would just return all terms immediately prior to event X in the array of values.
However, the background stats would be expensive to compute using a script which is why we disallow scripted value sources for significant terms.

Maybe in your case you might have to create a special form of entity index where the array of terms are truncated at index time to just the values leading up to event X for those clients that have experienced event X.
However, if you're lucky you don't need to truncate at all and any post-eventX terms are tuned out by significant_terms as commonly common and it can tune into the precursor events just fine. It all depends on the signal strength in your data.

Hi, thanks.
Currently the post-eventX are appearing very often after the significan_term query.
We do however usually see the cause of the problem somewhere among those, probably a post-processing of the query result could solve this

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.