Get unique counts with per minute bucket in elasticsearch

nitish1402 · December 3, 2019, 12:54pm

Hi Team , I have the following scenario , where i am receiving logs from at max 50k users in every 30 seconds for 60 mins. I am using elasticsearch to store the data in following format.

[
  {
    "viewlogId": "9abb5a3a-3678-4459-a425-ccb6f957e317",
    "creationTime": 1575187230000,
    "userId": "USERID_0",
    "viewingSessionId": "2991fa12_viewingSessionId_0_1"
  },
  {
    "viewlogId": "9abb5a3a-3678-4459-a425-ccb6f957e318",
    "creationTime": 1575187230000,
    "userId": "USERID_0",
    "viewingSessionId": "2991fa12_viewingSessionId_0_1"
  },
  {
    "viewlogId": "9abb5a3a-3678-4459-a425-ccb6f957e319",
    "creationTime": 1575187230000,
    "userId": "USERID_0",
    "viewingSessionId": "2991fa12_viewingSessionId_0_1"
  },
  {
    "viewlogId": "9abb5a3a-3678-4459-a425-ccb6f957e320",
    "creationTime": 1575187290000,
    "userId": "USERID_0",
    "viewingSessionId": "2991fa12_viewingSessionId_0_1"
  },
  {
    "viewlogId": "9abb5a3a-3678-4459-a425-ccb6f957e321",
    "creationTime": 1575187290000,
    "userId": "USERID_0",
    "viewingSessionId": "2991fa12_viewingSessionId_0_1"
  }
]

This Sample has data for single user for one session with viewingSessionId 2991fa12_viewingSessionId_0_1.The viewingSessionId is going to be unique for every user.

Now i am interested in showing a histogram per minute with unique viewsessionIds. for that i am using the following query.

GET <<index_name>>/_search
{
  "size": 0,
  "query": {
    "bool": {
      "adjust_pure_negative": true,
      "boost": 1
    }
  },
  "aggregations": {
    "total_views": {
      "cardinality": {
        "field": "viewingSessionId"
      }
    },
    "date_histogram_1": {
      "date_histogram": {
        "field": "creationTime",
        "fixed_interval": "1m"
      },
      "aggregations": {
        "user_counts": {
          "cardinality": {
            "field": "viewingSessionId"
          }
        }
      }
    }
  }
}

But according to elastic docs here and also i have observed during testing cardinality counts are approximate with threshold of 40k. Since I have 50k users and 1-2 viewlog per minute so in one bucket i will be having 150k records at max and counts will be approximate.

Any other approaches to solve the problem either by changing index structure or by querying

Thanks

Elasticsearch version : 7.4.1

system · December 31, 2019, 12:54pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.