Python3 Large Query Cardinality & Query return different results

hi, i'm using python3 and the elasticsearch library to query some very large ES indexes.

I've been following this blog post as a reference:

I'm getting some results i dont understand, the Cardnality returns a certain number ("precision_threshold": 100) and when i page through the aggrigration i get a different number of results returned:

Cardinality: 41941
Results Returned: 41084
.. so there is a difference in 857.. those results are important

Any suggestions would be appreciated.

Basic Code i'm using below:

for getting the cardinality:

    #creates the es object
    es = Elasticsearch(hosts=[self.host], timeout=60, max_retries=3, retry_on_timeout=True)
    dataDict = {}

    #this gets a rough estimate of how many records will be returned
    page = es.search(
        index=self.index,
        scroll='20m',
        body={
            "aggs": {
                "type_count": {
                    "cardinality": {
                        "field": self.field,
                        "precision_threshold": 100
                    } #end cardinality
                }#end type count
            }#end aggs
        }#end body
    )
    #print(page)
    print("Cardinality Page Results:", page['aggregations']['type_count']['value'])
    unique_results = page['aggregations']['type_count']['value']
    page_size = 10000

    pages_needed = unique_results / page_size
    pages_used=math.ceil(pages_needed)
    print ("Math:", pages_needed, " : ", "Rounded", pages_used )
    agg_no_pages=pages_used

This is the code i'm using to retrieve the results:

    page = es.search(
        index=self.index,
        scroll='20m',
        size=10000,
        body={
          "size": 0,
          "aggs": {
            "unique_ip": {
              "terms": {
                "field": self.field,
                "include":{
                    "partition":0,
                    "num_partitions": agg_no_pages
                }, #end include
                "size": page_size
            },
          } #end unique IP
          } #end aggs
        })

    #print (page['aggregations']['unique_ip'])
    print (page['aggregations']['unique_ip']['sum_other_doc_count'])
    print ("--===================--")
    for i in range(agg_no_pages):
        print ("Round:", i)
        page = es.search(
            index=self.index,
            scroll='2m',
            size=10000,
            body={
                "size": 0,
                "aggs": {
                    "unique_ip": {
                        "terms": {
                            "field": self.field,
                            "include": {
                                "partition": i,
                                "num_partitions": agg_no_pages
                            },  # end include
                            "size": 10000
                        },
                        "aggs": {
                            "ip_count": {
                                "cardinality": {
                                    "field": self.field
                                }
                            }
                        }
                    }
                }
            })
        #print(page['aggregations']['unique_ip'])
        #print(page['aggregations']['unique_ip']['sum_other_doc_count'])
        for item in page['aggregations']['unique_ip']['buckets']:
            #print ("Item:", item['key'], " : ", item['doc_count'])
            dataDict[item['key']]=item['doc_count']

     return dataDict.copy()

This is the logic of cardinality aggs, it will never give you 100% of all occurences of your term. you can increase precision_threshold up to 40000, but you will never get 100% :slight_smile:

thank you, thats what i was hoping.. i just needed someone else to let me know that was alright.

Is this the right approach to retrieving data?

I used to deal with same use case with high cardinality where i need to scroll all occurences
The best approche for me was pagination

thank you, thats what i'm doing

thank you for your help