Python3 Large Query Cardinality & Query return different results

hi, i'm using python3 and the elasticsearch library to query some very large ES indexes.

I've been following this blog post as a reference:

I'm getting some results i dont understand, the Cardnality returns a certain number ("precision_threshold": 100) and when i page through the aggrigration i get a different number of results returned:

Cardinality: 41941
Results Returned: 41084
.. so there is a difference in 857.. those results are important

Any suggestions would be appreciated.

Basic Code i'm using below:

for getting the cardinality:

    #creates the es object
    es = Elasticsearch(hosts=[self.host], timeout=60, max_retries=3, retry_on_timeout=True)
    dataDict = {}

    #this gets a rough estimate of how many records will be returned
    page = es.search(
        index=self.index,
        scroll='20m',
        body={
            "aggs": {
                "type_count": {
                    "cardinality": {
                        "field": self.field,
                        "precision_threshold": 100
                    } #end cardinality
                }#end type count
            }#end aggs
        }#end body
    )
    #print(page)
    print("Cardinality Page Results:", page['aggregations']['type_count']['value'])
    unique_results = page['aggregations']['type_count']['value']
    page_size = 10000

    pages_needed = unique_results / page_size
    pages_used=math.ceil(pages_needed)
    print ("Math:", pages_needed, " : ", "Rounded", pages_used )
    agg_no_pages=pages_used

This is the code i'm using to retrieve the results:

    page = es.search(
        index=self.index,
        scroll='20m',
        size=10000,
        body={
          "size": 0,
          "aggs": {
            "unique_ip": {
              "terms": {
                "field": self.field,
                "include":{
                    "partition":0,
                    "num_partitions": agg_no_pages
                }, #end include
                "size": page_size
            },
          } #end unique IP
          } #end aggs
        })

    #print (page['aggregations']['unique_ip'])
    print (page['aggregations']['unique_ip']['sum_other_doc_count'])
    print ("--===================--")
    for i in range(agg_no_pages):
        print ("Round:", i)
        page = es.search(
            index=self.index,
            scroll='2m',
            size=10000,
            body={
                "size": 0,
                "aggs": {
                    "unique_ip": {
                        "terms": {
                            "field": self.field,
                            "include": {
                                "partition": i,
                                "num_partitions": agg_no_pages
                            },  # end include
                            "size": 10000
                        },
                        "aggs": {
                            "ip_count": {
                                "cardinality": {
                                    "field": self.field
                                }
                            }
                        }
                    }
                }
            })
        #print(page['aggregations']['unique_ip'])
        #print(page['aggregations']['unique_ip']['sum_other_doc_count'])
        for item in page['aggregations']['unique_ip']['buckets']:
            #print ("Item:", item['key'], " : ", item['doc_count'])
            dataDict[item['key']]=item['doc_count']

     return dataDict.copy()

This is the logic of cardinality aggs, it will never give you 100% of all occurences of your term. you can increase precision_threshold up to 40000, but you will never get 100% :slight_smile:

thank you, thats what i was hoping.. i just needed someone else to let me know that was alright.

Is this the right approach to retrieving data?

I used to deal with same use case with high cardinality where i need to scroll all occurences
The best approche for me was pagination

thank you, thats what i'm doing

thank you for your help

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.