Elasticsearch sequence pattern mining

I am looking for a way to search patterns in the elasticsearch events.

Let us consider two different query matches that return documents of type A and B.

Is it possible to obtain all documents where an A is followed by a B which is followed by another A? The order is based on timestamps.

From another perspective, I am looking for a way to compare the timestamp field across multiple documents.

You can't do that without building another view of your data, ie "entity centric indexing".

To do this you'd build a document that represents all actions (eg you'd convert an index where each document represents a single click to a new index where each document represents a whole user session, with each click stored in the document as a nested object)

Take a look at https://www.elastic.co/videos/entity-centric-indexing-mark-harwood for more

Hey thanks for the entity-centric indexing technique. It is almost what I was looking for.

However if the query is really large and complex (like A->B->......->F->C->N) [around a size of 1500], would it be effective to create entity centric indexes? Also is it efficient/useful if I create the entity centric indexes while querying (i.e not create the entity centric indexes periodically as done by mark in the video)?

Generally the issue entity-centric is tackling is joining related data and it does so by shifting the costs involved from query time to index time.
If the key you join the data on has many unique values or the business logic in any derived properties is complex [1] then generally you will need to look at doing this to avoid overly-long or complex queries.

It would certainly be simpler to search for an indexed token that was ABFCN...

When I said "periodically" I did not necessarily mean overnight consolidation jobs. The update job could be run every second to patch in just the latest events. Think about your browser loading this web page - it is a flurry of activity involving many individual requests to get HTML, CSS, javascript, images etc. I wouldn't rush to update your entity-centric websession document upon receipt of every individual log record containing your session cookie. I could hang back just a second and perform only one update to your session doc with a batch of maybe 20 log file entries pertaining to your latest activity. This would save 19 Lucene updates. I think of it more as "micro-batching". Of course the sensible duration of a batch will depend on the nature of your system.

Cheers
Mark

[1] For a car the distance-driven-while-failed is a property derived from the difference of the mileage reported on the first test result failure on a car followed by the mileage on all test results up to and including a subsequent "pass" test result.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.