Can elasticsearch be used to help in this advanced usecase?

Hi there,

I have been asked if elasticsearch can be helpful solving the following task.

We want to analyze our user's workflow, meaning to check in which order users are calling our functions in a session and we would like to guess what users will do next. Lets make an example. Our logs are providing information of some user user workflows:

session 1: func1, func2, func16, func18, func46, func77, func33
session 2: func2, func16, func18, func46, func1, func12
session 3: func11, func16, func48, func88
session 4: func1, func2, func16, func18, func46, func77, func32
session 6: func1, func2, func46, func18, func16, func77, func33

Above we see 5 sessions and the order of their function calls.
I would like to be able to search for all workflows who have called for func16, then func 18, then func46 (in this order). And I want to get back what the next function call may be, ordered by the number of occurrence in my data lake.

So user 1, 2 and 4 are representing a valid search, because they have called the searched function in a given order.
As result I would like to get the next function call, ordered by hits.
So I would loke to get:
hits: 2, next workflow item 77 (by user 1 and 4)
hits 1: next workfow item func1 (by user 2)

All function calls are coming as single log events. Each workflow can be identified by some session id.
What possibilities do I have to reach my aim?

I have following ideas in my mind, but they might not work or have a lot of lacks inside.

#1: I could create parent / child relationships. The following function call has the previous function call as parent. But is it possible to search for a chain of parent child relations in elasticsearch with a single query? Is that more more or less expensive?

#2: I could create a long string field in a workflow event. With each function call I will append the new function call to the string.

  • Then I could search for some substring. I assume I need some analyzer tweaking (never got in touch with it yet) to get it quite performant. Are there any analyzers which are respecting the order of terms? It is also possible that a function is called multiple times. Then i also need to differ between func1, func2, func1 and func2, func1, func1 calls. Do analyzers can help here? I assume using a wildcardsearch like *func16, func18, func46* will be really expensive.
  • could i also add some fuzzy search options on the search that a single function call can be differ or be missing and that I still get some result?

I would be glad if someone could tell me, if elasticsearch is the right tool to answer this question and a point at #1. #2 or #3 (you have it :wink: ) would be really appreciated.

Thanks a lot, Andreas

1 Like

Hi,

Can I suggest you to check about ngram tokenizer to index your data:

session 1: func1, func2, func16, func18, func46 , func77, func33

It may help to search and avoid using wildcard, especially the start with.

Thanks, I will check this out and will also crawl deeper into the other tokenizers. Good hint.

But what about the order of occurrences of the tokens??
Ho do I search for words / tokens in a specific order in a field? Can elastic also help here?

Hi,
Sorry I think I sent you on a wrong path: :bowing_man:

According to the doc and after a couple of tests using Ngram out of the box will not help but you can prepare your data the same way with removing the noise.

example:

POST _analyze
{
  "tokenizer": {
          "type": "ngram",
          "min_gram": 5,
          "max_gram": 12,
          "token_chars": [
            "letter",
            "digit",
            "punctuation",
            "whitespace"
          ]
      },
  "text": "func1, func2, func16"
}

it return a warning:

! Deprecation: Deprecated big difference between max_gram and min_gram in NGram Tokenizer

and the following results:

  "tokens" : [
    {
      "token" : "func1",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
....
    {
      "token" : "func1, func",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "func1, func2",
      "start_offset" : 0,
      "end_offset" : 12,
      "type" : "word",
      "position" : 7
    },
    {.....

It return > 600 lines but as you see you have "func1" "func1, func2" tokens that you can search on. But need to remove the noise between as "func1, func" that is not used in your case.
There's no other choice than building your string before sending to elastic. Like you describe in #2.
I also checked in ingest but found nothing interesting that can help.

If you can send to elastic something like:

one log:

session 1: func1, func2, func16, func18, func46, func77, func33

one document:

{ 
  "user": "session 1",
  "sessions": [
      "func1",
      "func1, func2",
      "func1, func2, func16",
      "...",
      "func2, func16",
      "func2, func16, func18",
     "...",
     "func16, func18, func46",
     "..."
  ]
}

So this way you can be able to search and find func16, func18, func46 with a simple term search.

It solve this problem:

So user 1, 2 and 4 are representing a valid search, because they have called the searched function in a given order.

For this problem:

As result I would like to get the next function call, ordered by hits.

Need to store your document in another way and search from your first result... but mixing both will be hard.

Or maybe you already found #3 :grinning:

Hi,

that idea sounds very interesting:

I don't know how fast it would be, but I think that one could be done by logstash. When I document id's like ${session_id} I could use elasticsearch filterplugin in logstash to read the previous document if existing. If so I could concat the current function the existing strings and create new ones.

Then I could test what will be faster, wildcard search vs. full string in sessions array.
Thanks a lot, I will discuss it with my team.

To find the next function I could offload to application which reads the full string which concats all function calls in order. There the application could search for the next entry as first step. Later maybe I also could build some structure where I have a stored a pair of "concat_functions" and "next_call" stored.

I think it may be a good starting point.
Regards, Andreas

Then I could test what will be faster, wildcard search vs. full string in sessions array.

Definitively the term search will be faster, also check the heap you can crash your server with a too heavy query (I already did it with too much aggregations :sweat_smile:).

for the next function I think the stored pair concat_fun | next call is the better way to go as it can be really fast to search.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.