Python3 Large Query Cardinality & Query return different results

stcdarrell · November 17, 2020, 2:36pm

hi, i'm using python3 and the elasticsearch library to query some very large ES indexes.

I've been following this blog post as a reference:

I'm getting some results i dont understand, the Cardnality returns a certain number ("precision_threshold": 100) and when i page through the aggrigration i get a different number of results returned:

Cardinality: 41941
Results Returned: 41084
.. so there is a difference in 857.. those results are important

Any suggestions would be appreciated.

Basic Code i'm using below:

for getting the cardinality:


    #creates the es object
    es = Elasticsearch(hosts=[self.host], timeout=60, max_retries=3, retry_on_timeout=True)
    dataDict = {}

    #this gets a rough estimate of how many records will be returned
    page = es.search(
        index=self.index,
        scroll='20m',
        body={
            "aggs": {
                "type_count": {
                    "cardinality": {
                        "field": self.field,
                        "precision_threshold": 100
                    } #end cardinality
                }#end type count
            }#end aggs
        }#end body
    )
    #print(page)
    print("Cardinality Page Results:", page['aggregations']['type_count']['value'])
    unique_results = page['aggregations']['type_count']['value']
    page_size = 10000

    pages_needed = unique_results / page_size
    pages_used=math.ceil(pages_needed)
    print ("Math:", pages_needed, " : ", "Rounded", pages_used )
    agg_no_pages=pages_used

This is the code i'm using to retrieve the results:


    page = es.search(
        index=self.index,
        scroll='20m',
        size=10000,
        body={
          "size": 0,
          "aggs": {
            "unique_ip": {
              "terms": {
                "field": self.field,
                "include":{
                    "partition":0,
                    "num_partitions": agg_no_pages
                }, #end include
                "size": page_size
            },
          } #end unique IP
          } #end aggs
        })

    #print (page['aggregations']['unique_ip'])
    print (page['aggregations']['unique_ip']['sum_other_doc_count'])
    print ("--===================--")
    for i in range(agg_no_pages):
        print ("Round:", i)
        page = es.search(
            index=self.index,
            scroll='2m',
            size=10000,
            body={
                "size": 0,
                "aggs": {
                    "unique_ip": {
                        "terms": {
                            "field": self.field,
                            "include": {
                                "partition": i,
                                "num_partitions": agg_no_pages
                            },  # end include
                            "size": 10000
                        },
                        "aggs": {
                            "ip_count": {
                                "cardinality": {
                                    "field": self.field
                                }
                            }
                        }
                    }
                }
            })
        #print(page['aggregations']['unique_ip'])
        #print(page['aggregations']['unique_ip']['sum_other_doc_count'])
        for item in page['aggregations']['unique_ip']['buckets']:
            #print ("Item:", item['key'], " : ", item['doc_count'])
            dataDict[item['key']]=item['doc_count']

     return dataDict.copy()

ylasri · November 17, 2020, 2:44pm

This is the logic of cardinality aggs, it will never give you 100% of all occurences of your term. you can increase precision_threshold up to 40000, but you will never get 100%

stcdarrell · November 17, 2020, 3:12pm

thank you, thats what i was hoping.. i just needed someone else to let me know that was alright.

Is this the right approach to retrieving data?

ylasri · November 17, 2020, 3:23pm

I used to deal with same use case with high cardinality where i need to scroll all occurences
The best approche for me was pagination

stcdarrell · November 17, 2020, 6:45pm

thank you, thats what i'm doing

thank you for your help

system · December 15, 2020, 6:45pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Python3 query with large results, best approach Elasticsearch	11	2000	January 3, 2021
Python3 Elastic Query returns large amount of results.. am i doing this right? Elasticsearch	1	349	December 29, 2020
Cardinality not giving expected results Elasticsearch	2	342	July 6, 2017
Different result for cardinality aggregation [6.3.0] in Python 3.6 plugin Elasticsearch	1	503	October 15, 2018
Aggregation with terms and cardinality deliver different results Elasticsearch	2	1827	June 6, 2018

Python3 Large Query Cardinality & Query return different results

Related topics