Finding top users moving around vs stationary

Hi,

Let's say we have multiple servers providing a service. Users can connect to any of the server and get served.
There is a unique ID for a user and the server. We have log when a user 'connects' to a particular server.
Based on this, I want to find out:
a) Users who moved around most. i.e. First connected to server A, then B, maybe back to A (basically anytime it connected to a new server. It could connect to the same server again which is not to be counted here)
b) Users who moved around least. (Ones that didn't hop around much)

Now I am thinking I could read the data, do some processing in an external client application, and ingest additional bits of information for the 'connection' log that would indicate the 'earlier' server and also perhaps whether a change of server occurred.

Would like to know if there is any better way to directly figure this out using some query?

-Thanks
Nikhil

Interesting. Have you tried to aggregate the results first on the user ID and then on cardinality of the server ID? Is the output what that you require?

{
 "aggs": {
   "users": {
     "terms": {
       "field": "src.userid.keyword",
     },
     "aggs": {
       "servers": {
         "cardinality": {
           "field": "src.serverid.keyword"
         }
       }
     }
   }
 } 
}

Thanks for responding Nachiket.

How can I create a visualization using Kibana for the above query?
Sounds like I need to use Visual Builder.
I am trying to verify the correctness of the query. It does return values which I would have expected to see though.

I could create the Visualization. Was looking for Cardinality, didn't notice the friendlier option "Unique Count".
Its not working the way expected. It is showing the value which is same as the number of servers. Perhaps if I make the interval small enough and then add up all the individual values during those small buckets might give me the required answer. Let play with the Visual Builder a bit more.

Hi Nikhil,

Try to create that visualization using a data table. That should give you a more to the point answer.

In visual builder your values will depend on the interval that you select. It's a little more flexible, but I guess for your requirement the standard data table under visualization should suffice.

Nachiket,

A simple data table will not help.
Imagine user A has moved from Server1->Server2->Server3.
Whereas user B has moved from Server1->Server2->Server3->Server1->Server2->Server3.
Now in this case, I would like count of 2 for user B and count of 5 for user B (as this user has moved 5 times).
That's why I am thinking of Visual Builder where I can count unique counts in a small interval and then sum them up. It will not be completely accurate but I don't care about the actual number rather just who the top guys are. Thanks.

Hi Nikhil,

I tried to reproduce this and I was able to create the following table:

image

I used the following aggs to create this:
image

I had assumed the data-set to be as follows:

{
  "took": 8,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 8,
    "max_score": 1,
    "hits": [
      {
        "_index": "twitter",
        "_type": "_doc",
        "_id": "7Q7oE2UBpybRgQMxe1Kk",
        "_score": 1,
        "_source": {
          "user": "matt",
          "server": "B"
        }
      },
      {
        "_index": "twitter",
        "_type": "_doc",
        "_id": "7w7oE2UBpybRgQMxllIY",
        "_score": 1,
        "_source": {
          "user": "matt",
          "server": "C"
        }
      },
      {
        "_index": "twitter",
        "_type": "_doc",
        "_id": "6A7nE2UBpybRgQMx71Jd",
        "_score": 1,
        "_source": {
          "user": "kimchy",
          "server": "A"
        }
      },
      {
        "_index": "twitter",
        "_type": "_doc",
        "_id": "6g7oE2UBpybRgQMxJlL6",
        "_score": 1,
        "_source": {
          "user": "kimchy",
          "server": "C"
        }
      },
      {
        "_index": "twitter",
        "_type": "_doc",
        "_id": "6w7oE2UBpybRgQMxT1JR",
        "_score": 1,
        "_source": {
          "user": "tom",
          "server": "C"
        }
      },
      {
        "_index": "twitter",
        "_type": "_doc",
        "_id": "7A7oE2UBpybRgQMxZVJ4",
        "_score": 1,
        "_source": {
          "user": "tom",
          "server": "B"
        }
      },
      {
        "_index": "twitter",
        "_type": "_doc",
        "_id": "6Q7oE2UBpybRgQMxCFKz",
        "_score": 1,
        "_source": {
          "user": "kimchy",
          "server": "B"
        }
      },
      {
        "_index": "twitter",
        "_type": "_doc",
        "_id": "7g7oE2UBpybRgQMxiFL_",
        "_score": 1,
        "_source": {
          "user": "matt",
          "server": "A"
        }
      }
    ]
  }
}

Nachiket,

As I mentioned, I need to capture the 'movement' of user across servers.
As per your example, the value will be same for users who has visited each server once vs someone else who did it a 100 times in sequence.

When you initially said movement across servers, i presumed that you are not worried about consecutive hits on the same server.

Do you have something like a unique session id for each session that the user has per request? What are the fields present in each document?

Does each doc, have only one entry for that specific server?

Is this something similar to what was expected?

Hi Nachiket,

Appreciate your responses.

Yes, you are right that I do not care about consecutive hits on same server.
There is no unique per session ID, the user ID is what is unique across all sessions.
I am simplifying the log string, it contains:

  1. String to identify that it is a new session create
  2. the server name where session is created
  3. The unique user ID.

So to give an example, lets consider two scenarios of user movement:
Scenario A:

  1. User 1-> Server A
  2. User 1-> Server B

Scenario B:

  1. User 1-> Server A
  2. User 1-> Server B
  3. User 1-> Server A
  4. User 1-> Server B

Scenario A: Here I would like to get a count of 1 as user has moved once (2 is also OK since I don't care about actual value but relative to other users).

Scenario B: Here I would like to get a count of 3 as user has moved thrice (4 is also fine).

If I use the query that you gave, for both cases I will get a value of 2 since that is the unique count of servers.

HTH.

Understood. :slight_smile:

I am afraid, that there is no way to achieve what you are trying to do directly in Kibana (At least not that I am aware of).

It could be possible to create this visualization by using something of an aggregate filter prior to indexing logs in Elasticsearch. This would help achieve two things:

  1. Remove consecutive duplicate similar entries
  2. Add a count field to preserve the number of events aggregated.

Have a look here:
https://www.elastic.co/guide/en/logstash/current/plugins-filters-aggregate.html

Having redundant logs that contain no additional information is pointless. Would it be possible to suppress that kind of logging at the application level?

What you're talking about is a form of behavioural analysis on entities (people/sessions) but doing that using a log index. That can be problematic.

Consider creating an entity-centric index.
This video shows why and how.
These scripts and example data provide a start point.

Your session entity update script would have the following pseudo code:

if newLoggedServer != lastLoggedServer
     numServersUsed++
     lastLoggedServer = newLoggedServer

You could then use Kibana on your sessions index aggregating on the pre-calculated numServersUsed field.

Quite insightful talk @Mark_Harwood. Helped clear a few queries I had related to an User & Entity based Analytics project that I was working on.

But the 'pay as you go' model that you talked about is effective if you are aware of the questions that the customer would ask. Do you have any thoughts or ideas about a scenario in which the data is already indexed, similar to what Nikhil is talking. I have faced this issue quite a few times, and have experimented with having some kind of aggregated metrics beforehand. But this always has a trade-off in indexing performance with no guarantee that the metrics that you have would be of any use.

Thanks Mark, I will go through the links.

FYI, we are developing our own add-on module which we refer to as 'Post Analysis' (Not an accurate term but its got a nice ring to it. :grinning:). Since we have diverse teams who will want to analyze the logs differently and in an ever evolving manner, we don't want to do too much processing during ingestion. What we have instead is a client-side application that queries the data, does the analysis and adds the analyzed events back. So it becomes a need-basis operation with the flexibility to do whatever we want inside the python application.

So for this, I can always use the same method and add the information of "from" & "to" along with 'change' flag which I can then simply count.
Thanks.

Sometimes the entity attributes you might choose to store could be ingredients for as-yet-unknown derivations. For example, you could store counters for numSuccessfulLogins and numFailedLogins along with lots of other simple attributes. At query time you could use scripts to derive new ratios from combinations of these attributes e.g. the login failure ratio for a client device is numFailedLogins / (numSuccessfulLogins + numFailedLogins). The point is these entity attributes bring data locality and save you query-time scrabbling around in a event-centric index trying to piece together related information around entities which has been scattered across a network. A script deriving new ratios from data held in a single entity document is far more efficient and scalable to process.
Admittedly you still need a clue as to what attributes may make useful ingredients in future but the benefits of index-time data fusion still exist.

1 Like

Makes sense. The approach I advocate isn't to update an entity with every new event that is logged. It's more a "micro batching" approach where periodically you pull and consolidate multiple events in a single update.
This periodic pull of course could use a query as a filter and either aggregations or client-side collapsing to thin out the events that are then applied as a single update to your entity.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.