Python3 Elastic Query returns large amount of results.. am i doing this right?

hi, i'm still not sure i'm doing this right.. so any suggestions would be appreciated..
i have a LARGE data index of log data containing IP addresses.. ports.. and other network traffic info. its millions of entries a month.. and i have a years worth of data. i'm trying to write a python script to build some end of the year data for a report.

as of now i'm just trying to pull the unique IP addresses and the count of how many times those IP addresses. occur over a given time period. seems pretty simple.. i'd like to then pull the previous months data.. then do some comparisons of the two. Below is my search/query function. I submit a date range, and a column/field in the index. it should return the unique values of that field and count.

it returns something.. i do a calculation based on the page['aggregations']['type_count']['value'] to calculate how many pages i'll need if the return size is 10000 (the maximum return size).. i then round that number up.. That number is used in a for loop to pull all the values.. or thats what i WANT it to do.. but it seems as though no matter what that number is, it always pulls some kind of values.. example: if i do the query and the page['aggregations']['type_count']['value'] = 41231, i divide that number by 10,000.. and i get 4.1231.. i round up to 5.. and loop through 5 pages .. and get 5 pages of results.. what i dont understand is .. if i set the loop to 10.. i will still get 10 pages of results.. WHY? does that make sense? if there a better algorithm or process in pulling a large query result from ES?

here is my basic code:
def search(self):
# creates the es object
es = Elasticsearch(hosts=[self.host], timeout=60, max_retries=3, retry_on_timeout=True)
dataDict = {}

    # this gets a rough estimate of how many records will be returned
    page = es.search(
        index=self.index,
        scroll='20m',
        body={
            "query":{
            "bool" : {
                "must" : {
                    #self.start_Date: example 2020-10-01, self.end_Date: example 2020-10-31
                    "range": {"@timestamp": {"gte": self.start_date,"lte": self.end_Date}}
                }#must
                }#bool
            },#query,
            "aggs": {
                "type_count": {
                    "cardinality": {
                        "field": self.field,
                        "precision_threshold": 40000
                    } #end cardinality
                }#end type count
            },#end aggs
        }#body
    )
    print("Cardinality Page Results:", page['aggregations']['type_count']['value'])
    unique_results = page['aggregations']['type_count']['value']
    page_size = 10000

   #calculate how many pages i'm going to need
    pages_needed = unique_results / page_size
    pages_used = math.ceil(pages_needed) #rounds up for amount of pages needed
    print("Math:", pages_needed, " : ", "Rounded", pages_used)
    agg_no_pages = pages_used

    for i in range(agg_no_pages):
        print ("Round:", i)
        page = es.search(
            index=self.index,
            scroll='2m',
            size=10000,
            body={
                "size": 0,
                "aggs": {
                    "unique_ip": {
                        "terms": {
                            "field": self.field,
                            "include": {
                                "partition": i,
                                "num_partitions": agg_no_pages
                            },  # end include
                            "size": 10000
                        },
                        "aggs": {
                            "ip_count": {
                                "cardinality": {
                                    "field": self.field
                                }
                            }
                        }
                    }
                }
            })
        print (i,":",page)

        for item in page['aggregations']['unique_ip']['buckets']:
            print ("::", item['key'], ":", item['doc_count'])
            dataDict[item['key']]=item['doc_count']
       
       self.results=dataDict.copy()

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.