Grouping logs into sessions

My entries in Elasticsearch are logs of different event. I am trying to group the logs into sessions of users based on an attribute of the logs. Each log has action attribute, everytime there is the action "session_start" it means a new session was started and all the logs until the next action with "session_start" are related to one session. What I want to do is create some visualization based on the sessions.

My first question is how can I split the logs into sessions which I can later analyze? I was trying to add an id to all entries which belongs to one session but I am having trouble doing that using painless as I cannot increase the values of id per iteration.

After that depending of course, of how I did the split into sessions, is there some tutorial that can walk me through creating visualizations?

Can you provide some sample log messages, show how you are getting them now and what you expected?

Also, how are you indexing this data? Using Filebeat? Logstash? Another tool?

Do you have any unique identifier on the logs that are part of the same session?

For simplicity for now I just uploading the log using a csv file.

There is log_id which can be used as unique identifier.

Some entries of a few logs:

"_id": “\\\”1234566788ab214\\\””,
        "_score": 1,
        "_source": {
          "log_id": "1234566788ab214",
          "policy_id": "registration",
          "time_lo_res": "1697182080000",
          "@timestamp": "2023-10-13T07:28:27.545Z",
          "action": “session_start”,
          "time": "1697182107545",
        },

 "_id": “\\\”1234566788ab217\\\””,
        "_score": 1,
        "_source": {
          "log_id": "1234566788ab217”,
          "policy_id": "registration",
          "time_lo_res": "1697182080000",
          "@timestamp": "2023-10-13T07:28:27.545Z",
          "action": “function”,
          "time": "1697182107745",
        },

 "_id": “\\\”1234566788ab227\\\””,
        "_score": 1,
        "_source": {
          "log_id": "1234566788ab227”,
          "policy_id": "registration",
          "time_lo_res": "1697182080000",
          "@timestamp": "2023-10-13T07:28:27.545Z",
          "action": “form”,
          "time": "1697182108545",
        },

 "_id": “\\\”1234566788ab237\\\””,
        "_score": 1,
        "_source": {
          "log_id": "1234566788ab237”,
          "policy_id": "registration",
          "time_lo_res": "1697182080000",
          "@timestamp": "2023-10-13T07:28:27.545Z",
          "action": “session_start”,
          "time": "1697182207545",
        }

And what I want is to somehow link the first three logs to one so called session and the last one to another one such that when creating dashboards I will be able to analyze different values for session for example if the session ended or how long does it take to finish the session.

Not in this case, I meant something that you can use to correlate the different events.

For example, the first 3 events have nothing in common that you can use to correlate them and tell that they are part of the same session:

"log_id": "1234566788ab214",
"policy_id": "registration",
"time_lo_res": "1697182080000",
"@timestamp": "2023-10-13T07:28:27.545Z",
"action": “session_start”,
"time": "1697182107545",
          
"log_id": "1234566788ab217”,
"policy_id": "registration",
"time_lo_res": "1697182080000",
"@timestamp": "2023-10-13T07:28:27.545Z",
"action": “function”,
"time": "1697182107745",
          
"log_id": "1234566788ab227”,
"policy_id": "registration",
"time_lo_res": "1697182080000",
"@timestamp": "2023-10-13T07:28:27.545Z",
"action": “form”,
"time": "1697182108545",

You do not have any field that could be used to group the events, like a session_id that would be present in events from the same session and have the same value.

You may be able to aggregate the events that are part of the same session using Logstash with the aggregate filter, this way you would have one event per session.

Can you share the source csv file?

Unfortunately I cannot share it. Indeed there is no value that can be used to correlate the different sessions. I managed to split the logs to sessions in python by first sort based on log_id then iterating through them and everytime I see a session start I split it into a different session (it is a bit more complex because sometimes the session_start does not mean new session), I am sharing the python code maybe it will give more insight

list_per_session = []

path = []

change_jour = False


for i,row in df.iterrows():
    if row.action == "session_start" and (not change_jour): # end of a journey
        if path: # add only if not empty 
            list_per_session.append(path)
            path = []
    path.append(row)
    if row.action == "session_start" and change_jour: # the first step of the new journey
        change_jour = False
    if row.action == "assertion_end" and row.action2 == "redirect":
        change_jour = True
    

I created some dashboards using python but I wanted to use some tool which is more dynamic so I thought about using elastic and now I am trying to transform the logic I have in python to elastic.

So, you are creating the csv files using some python code?

If yes, then you could create a session_id for each session and add it to every event that is part of the same session.

For example, the following code can be used to generate a random id with 12 characters.

import uuid
session_id = uuid.uuid4().hex[0:12]

Then you would need to add this session_id to every event that is part of the same session, so you would end up with something like this:

"log_id": "1234566788ab214",
"policy_id": "registration",
"time_lo_res": "1697182080000",
"@timestamp": "2023-10-13T07:28:27.545Z",
"action": “session_start”,
"session_id": "af443e6410c2",
"time": "1697182107545"
          
"log_id": "1234566788ab217”,
"policy_id": "registration",
"time_lo_res": "1697182080000",
"@timestamp": "2023-10-13T07:28:27.545Z",
"session_id": "af443e6410c2",
"action": “function”,
"time": "1697182107745"
          
"log_id": "1234566788ab227”,
"policy_id": "registration",
"time_lo_res": "1697182080000",
"@timestamp": "2023-10-13T07:28:27.545Z",
"action": “form”,
"session_id": "af443e6410c2",
"time": "1697182108545"

Then you could filter on the session_id field in Kibana to see the events of the session.

1 Like

Sorry I was not clear, the csv file I am getting from a database of a server and I want to create dashboards for it. I did it using python but I want to instead use elastic. In the future I want to connect the database directly to elastic and have live dashboard. But for now I am trying to do it step by step and first see how can I get the dashboards.

Oh, I understood that you were creating the csv file using the python code. So you are getting the csv from a database?

As mentioned you need something to correlate the events of the same session and currently you do not have it.

You need to do that before indexing the data in Elasticsearch, if you use Logstash to ingest your data in Elasticsearch you may be able to aggregate the events of a session in a single event, and work with that later.

But to provide more insight about this I would need to have a sample of your csv, you could redact any sensitive value.

Both Logstash and Elasticsearch are event based and every event is independent from each other, to correlate events you need something in common between them, and your data does not have it, so you need to add it before ingestion.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.