Running cardinality for more than 10000 buckets

Hi, I am making a project to get the unique number of userids per url in in our index .
Index has more than 5000 urls and far more number of users ids. To the get the unique number if user ids i am using cardinality on the buckets received for the urls . Right now testing on a single node but elastic shuts down when I am querying the data for a whole month . If i use Pagination , i wont be able to get the unique records . Surely , in production I would increase the number of nodes . But , is there any option you would like to suggest ?

Hi divyang,
Cardinality aggs nested underneath a field like URL which has a lot of unique values uses a lot of RAM.
Your options are for reducing RAM usage are:

  1. Trade space for some accuracy using the precision_threshold setting of the cardinality agg
  2. Break your single request into multiple calls (using either the composite agg instead of the terms agg or use the partition feature of the terms aggregation).

Thanks for the prompt reply . Partioning feature is very helpful , Just to clarify , a given url will be present in only 1 partition , not any other , right

Correct. The composite agg offers that guarantee too but can't sort by child agg (eg URLs sorted by number of unique visitors). The terms agg can sort by child agg but the order guarantees are only for the URLs that fall into the same partition.

i havent used sorting in the query , is it necessary ? on what basis are the partitions formed ?

Also , I am not sure how composite aggregation would be used with cardinality , so havent tried it

Hash modulo N. The same technique used to route documents by ID to a choice of shard.
You just pick what "N" is at query time and it's a way of evenly dividing up a set of values based on hashing the values.

Composite agg processes values in value order - if terms partitioning is taking a random subset of all terms in each "page" then composite agg is getting the next N terms after the last page's last term. You just swap the composite agg for your terms agg and make URL the choice of value by which it sorts buckets.

i tried the composite aggregation as well , it isnt giving me a direct answer , i will need to process the results in program code . Are you suggesting composite aggregation over partioning , Im in favour of partioning because its giving a direct answer

Result after a composite Query:

{

"after_key": {
    "page_urlpath": "/",
    "domain_sessionid": "00000d9f-7628-429e-8f97-f82aba38b2d6"
},
"buckets": [
    {
        "key": {
            "page_urlpath": "/",
            "domain_sessionid": "000006ba-cead-4577-8374-b5ae43434a47"
        },
        "doc_count": 4
    }
    ,
    {
        "key": {
            "page_urlpath": "/",
            "domain_sessionid": "00000d9f-7628-429e-8f97-f82aba38b2d6"
        },
        "doc_count": 3
    }

I need to see your query JSON - it looks like you haven't embedded the cardinality agg underneath the composite agg.

Sorry I tried it again its working , finally im getting answer through both methods , which one do you suggest or are both good ?

This is my new query json;

{
"from": 0,
"size": 0,
"sort" : [{"page_urlpath":{"order":"asc"}}],
"aggs": {
"my_buckets": {
"composite": {
"size": 2,
"sources": [
{
"page_urlpath": {
"terms": {
"field": "page_urlpath.keyword"
}
}
}
]
},
"aggregations": {
"visitors": {
"cardinality": {
"field": "domain_sessionid.keyword",
"precision_threshold": 40000
}
}
}
}
}
}

Not much in it I expect but I'd probably go with composite if order is unimportant.

Is that not on the small side?

No that was just for seeing if the query is working , i had tried composite queries before but were giving errors , Later i tried with a size of 100 , it worked , then a size of 1000 , elasrtic went down :grinning:

1 Like

As I am running partitioning for querying the pageurl buckets , I am thinking to use 100-150 partitions , is that ok ? Is there a cap on the number of partitions or a max number that is advisable ?

The docs talk about how to pick a suitable size