Running cardinality for more than 10000 buckets

divyang · July 29, 2019, 2:03pm

Hi, I am making a project to get the unique number of userids per url in in our index .
Index has more than 5000 urls and far more number of users ids. To the get the unique number if user ids i am using cardinality on the buckets received for the urls . Right now testing on a single node but elastic shuts down when I am querying the data for a whole month . If i use Pagination , i wont be able to get the unique records . Surely , in production I would increase the number of nodes . But , is there any option you would like to suggest ?

Mark_Harwood · July 29, 2019, 2:08pm

Hi divyang,
Cardinality aggs nested underneath a field like URL which has a lot of unique values uses a lot of RAM.
Your options are for reducing RAM usage are:

Trade space for some accuracy using the precision_threshold setting of the cardinality agg
Break your single request into multiple calls (using either the composite agg instead of the terms agg or use the partition feature of the terms aggregation).

divyang · July 29, 2019, 2:35pm

Thanks for the prompt reply . Partioning feature is very helpful , Just to clarify , a given url will be present in only 1 partition , not any other , right

Mark_Harwood · July 29, 2019, 2:52pm

Correct. The composite agg offers that guarantee too but can't sort by child agg (eg URLs sorted by number of unique visitors). The terms agg can sort by child agg but the order guarantees are only for the URLs that fall into the same partition.

divyang · July 29, 2019, 3:00pm

i havent used sorting in the query , is it necessary ? on what basis are the partitions formed ?

divyang · July 29, 2019, 3:02pm

Also , I am not sure how composite aggregation would be used with cardinality , so havent tried it

Mark_Harwood · July 29, 2019, 3:06pm

Hash modulo N. The same technique used to route documents by ID to a choice of shard.
You just pick what "N" is at query time and it's a way of evenly dividing up a set of values based on hashing the values.

Composite agg processes values in value order - if terms partitioning is taking a random subset of all terms in each "page" then composite agg is getting the next N terms after the last page's last term. You just swap the composite agg for your terms agg and make URL the choice of value by which it sorts buckets.

divyang · July 29, 2019, 3:25pm

i tried the composite aggregation as well , it isnt giving me a direct answer , i will need to process the results in program code . Are you suggesting composite aggregation over partioning , Im in favour of partioning because its giving a direct answer

Result after a composite Query:

{

"after_key": {
    "page_urlpath": "/",
    "domain_sessionid": "00000d9f-7628-429e-8f97-f82aba38b2d6"
},
"buckets": [
    {
        "key": {
            "page_urlpath": "/",
            "domain_sessionid": "000006ba-cead-4577-8374-b5ae43434a47"
        },
        "doc_count": 4
    }
    ,
    {
        "key": {
            "page_urlpath": "/",
            "domain_sessionid": "00000d9f-7628-429e-8f97-f82aba38b2d6"
        },
        "doc_count": 3
    }

Mark_Harwood · July 29, 2019, 3:46pm

I need to see your query JSON - it looks like you haven't embedded the cardinality agg underneath the composite agg.

divyang · July 29, 2019, 3:48pm

Sorry I tried it again its working , finally im getting answer through both methods , which one do you suggest or are both good ?

This is my new query json;

{
"from": 0,
"size": 0,
"sort" : [{"page_urlpath":{"order":"asc"}}],
"aggs": {
"my_buckets": {
"composite": {
"size": 2,
"sources": [
{
"page_urlpath": {
"terms": {
"field": "page_urlpath.keyword"
}
}
}
]
},
"aggregations": {
"visitors": {
"cardinality": {
"field": "domain_sessionid.keyword",
"precision_threshold": 40000
}
}
}
}
}
}

Mark_Harwood · July 29, 2019, 3:55pm

Not much in it I expect but I'd probably go with composite if order is unimportant.

Is that not on the small side?

divyang · July 29, 2019, 3:57pm

No that was just for seeing if the query is working , i had tried composite queries before but were giving errors , Later i tried with a size of 100 , it worked , then a size of 1000 , elasrtic went down

divyang · July 31, 2019, 3:09pm

As I am running partitioning for querying the pageurl buckets , I am thinking to use 100-150 partitions , is that ok ? Is there a cap on the number of partitions or a max number that is advisable ?

Mark_Harwood · July 31, 2019, 4:07pm

The docs talk about how to pick a suitable size

system · August 28, 2019, 4:07pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Dynamically assign aggregation partitionings Elasticsearch	6	991	April 10, 2018
Cardinality Aggregation gives wrong number? Elasticsearch	33	7688	March 7, 2019
Aggregation to take the first result for every unique value of a term Elasticsearch	4	5491	February 20, 2018
Getting accurate cardinality for a field in single shard index Elasticsearch	8	1372	July 5, 2017
Cardinality aggregation: discrete shards Elasticsearch	1	339	July 5, 2017

Running cardinality for more than 10000 buckets

Related topics