Elasticsearch sequence pattern mining

vinay_khandelwal · June 5, 2017, 10:24am

I am looking for a way to search patterns in the elasticsearch events.

Let us consider two different query matches that return documents of type A and B.

Is it possible to obtain all documents where an A is followed by a B which is followed by another A? The order is based on timestamps.

From another perspective, I am looking for a way to compare the timestamp field across multiple documents.

Clinton_Gormley · June 9, 2017, 11:37am

You can't do that without building another view of your data, ie "entity centric indexing".

To do this you'd build a document that represents all actions (eg you'd convert an index where each document represents a single click to a new index where each document represents a whole user session, with each click stored in the document as a nested object)

Take a look at https://www.elastic.co/videos/entity-centric-indexing-mark-harwood for more

vinay_khandelwal · June 14, 2017, 9:20am

Hey thanks for the entity-centric indexing technique. It is almost what I was looking for.

However if the query is really large and complex (like A->B->......->F->C->N) [around a size of 1500], would it be effective to create entity centric indexes? Also is it efficient/useful if I create the entity centric indexes while querying (i.e not create the entity centric indexes periodically as done by mark in the video)?

Mark_Harwood · June 14, 2017, 10:19am

Generally the issue entity-centric is tackling is joining related data and it does so by shifting the costs involved from query time to index time.
If the key you join the data on has many unique values or the business logic in any derived properties is complex [1] then generally you will need to look at doing this to avoid overly-long or complex queries.

It would certainly be simpler to search for an indexed token that was ABFCN...

When I said "periodically" I did not necessarily mean overnight consolidation jobs. The update job could be run every second to patch in just the latest events. Think about your browser loading this web page - it is a flurry of activity involving many individual requests to get HTML, CSS, javascript, images etc. I wouldn't rush to update your entity-centric websession document upon receipt of every individual log record containing your session cookie. I could hang back just a second and perform only one update to your session doc with a batch of maybe 20 log file entries pertaining to your latest activity. This would save 19 Lucene updates. I think of it more as "micro-batching". Of course the sensible duration of a batch will depend on the nature of your system.

Cheers
Mark

[1] For a car the distance-driven-while-failed is a property derived from the difference of the mileage reported on the first test result failure on a car followed by the mileage on all test results up to and including a subsequent "pass" test result.

system · July 12, 2017, 10:19am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Entity centric indices Elasticsearch	2	666	July 5, 2017
Time/Order based query Elasticsearch	3	407	July 5, 2017
Search according to document sequence Elasticsearch	1	868	July 6, 2017
Entity-Centric Indexing - reliability and performance Elasticsearch	16	3562	August 16, 2017
Search for a sequence Elasticsearch	2	624	July 6, 2017

Elasticsearch sequence pattern mining

Related topics