hi, i'm using python3 and the elasticsearch library to query some very large ES indexes.
I've been following this blog post as a reference:
I'm getting some results i dont understand, the Cardnality returns a certain number ("precision_threshold": 100) and when i page through the aggrigration i get a different number of results returned:
Cardinality: 41941
Results Returned: 41084
.. so there is a difference in 857.. those results are importantAny suggestions would be appreciated.
Basic Code i'm using below:
for getting the cardinality:
#creates the es object es = Elasticsearch(hosts=[self.host], timeout=60, max_retries=3, retry_on_timeout=True) dataDict = {} #this gets a rough estimate of how many records will be returned page = es.search( index=self.index, scroll='20m', body={ "aggs": { "type_count": { "cardinality": { "field": self.field, "precision_threshold": 100 } #end cardinality }#end type count }#end aggs }#end body ) #print(page) print("Cardinality Page Results:", page['aggregations']['type_count']['value']) unique_results = page['aggregations']['type_count']['value'] page_size = 10000 pages_needed = unique_results / page_size pages_used=math.ceil(pages_needed) print ("Math:", pages_needed, " : ", "Rounded", pages_used ) agg_no_pages=pages_usedThis is the code i'm using to retrieve the results:
page = es.search( index=self.index, scroll='20m', size=10000, body={ "size": 0, "aggs": { "unique_ip": { "terms": { "field": self.field, "include":{ "partition":0, "num_partitions": agg_no_pages }, #end include "size": page_size }, } #end unique IP } #end aggs }) #print (page['aggregations']['unique_ip']) print (page['aggregations']['unique_ip']['sum_other_doc_count']) print ("--===================--") for i in range(agg_no_pages): print ("Round:", i) page = es.search( index=self.index, scroll='2m', size=10000, body={ "size": 0, "aggs": { "unique_ip": { "terms": { "field": self.field, "include": { "partition": i, "num_partitions": agg_no_pages }, # end include "size": 10000 }, "aggs": { "ip_count": { "cardinality": { "field": self.field } } } } } }) #print(page['aggregations']['unique_ip']) #print(page['aggregations']['unique_ip']['sum_other_doc_count']) for item in page['aggregations']['unique_ip']['buckets']: #print ("Item:", item['key'], " : ", item['doc_count']) dataDict[item['key']]=item['doc_count'] return dataDict.copy()
This is the logic of cardinality aggs, it will never give you 100% of all occurences of your term. you can increase precision_threshold up to 40000, but you will never get 100% 
thank you, thats what i was hoping.. i just needed someone else to let me know that was alright.
Is this the right approach to retrieving data?
I used to deal with same use case with high cardinality where i need to scroll all occurences
The best approche for me was pagination
thank you, thats what i'm doing
thank you for your help