Python3 Elastic Query returns large amount of results.. am i doing this right?

stcdarrell · December 1, 2020, 10:30pm

hi, i'm still not sure i'm doing this right.. so any suggestions would be appreciated..
i have a LARGE data index of log data containing IP addresses.. ports.. and other network traffic info. its millions of entries a month.. and i have a years worth of data. i'm trying to write a python script to build some end of the year data for a report.

as of now i'm just trying to pull the unique IP addresses and the count of how many times those IP addresses. occur over a given time period. seems pretty simple.. i'd like to then pull the previous months data.. then do some comparisons of the two. Below is my search/query function. I submit a date range, and a column/field in the index. it should return the unique values of that field and count.

it returns something.. i do a calculation based on the page['aggregations']['type_count']['value'] to calculate how many pages i'll need if the return size is 10000 (the maximum return size).. i then round that number up.. That number is used in a for loop to pull all the values.. or thats what i WANT it to do.. but it seems as though no matter what that number is, it always pulls some kind of values.. example: if i do the query and the page['aggregations']['type_count']['value'] = 41231, i divide that number by 10,000.. and i get 4.1231.. i round up to 5.. and loop through 5 pages .. and get 5 pages of results.. what i dont understand is .. if i set the loop to 10.. i will still get 10 pages of results.. WHY? does that make sense? if there a better algorithm or process in pulling a large query result from ES?

here is my basic code:
def search(self):
# creates the es object
es = Elasticsearch(hosts=[self.host], timeout=60, max_retries=3, retry_on_timeout=True)
dataDict = {}

    # this gets a rough estimate of how many records will be returned
    page = es.search(
        index=self.index,
        scroll='20m',
        body={
            "query":{
            "bool" : {
                "must" : {
                    #self.start_Date: example 2020-10-01, self.end_Date: example 2020-10-31
                    "range": {"@timestamp": {"gte": self.start_date,"lte": self.end_Date}}
                }#must
                }#bool
            },#query,
            "aggs": {
                "type_count": {
                    "cardinality": {
                        "field": self.field,
                        "precision_threshold": 40000
                    } #end cardinality
                }#end type count
            },#end aggs
        }#body
    )
    print("Cardinality Page Results:", page['aggregations']['type_count']['value'])
    unique_results = page['aggregations']['type_count']['value']
    page_size = 10000

   #calculate how many pages i'm going to need
    pages_needed = unique_results / page_size
    pages_used = math.ceil(pages_needed) #rounds up for amount of pages needed
    print("Math:", pages_needed, " : ", "Rounded", pages_used)
    agg_no_pages = pages_used

    for i in range(agg_no_pages):
        print ("Round:", i)
        page = es.search(
            index=self.index,
            scroll='2m',
            size=10000,
            body={
                "size": 0,
                "aggs": {
                    "unique_ip": {
                        "terms": {
                            "field": self.field,
                            "include": {
                                "partition": i,
                                "num_partitions": agg_no_pages
                            },  # end include
                            "size": 10000
                        },
                        "aggs": {
                            "ip_count": {
                                "cardinality": {
                                    "field": self.field
                                }
                            }
                        }
                    }
                }
            })
        print (i,":",page)

        for item in page['aggregations']['unique_ip']['buckets']:
            print ("::", item['key'], ":", item['doc_count'])
            dataDict[item['key']]=item['doc_count']
       
       self.results=dataDict.copy()

system · December 29, 2020, 10:30pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Python3 query with large results, best approach Elasticsearch	11	1941	January 3, 2021
Python3 Large Query Cardinality & Query return different results Elasticsearch	5	329	December 15, 2020
Saving Results from Millions of Documents of varying sizes with Python Elasticsearch Client Elasticsearch	1	905	April 26, 2017
Need help with scan/scroll using elasticsearch-py client Elasticsearch	2	11289	April 11, 2017
Python Elasticsearch query not returning the expected results when running subsequent calls Elasticsearch	8	3225	July 6, 2017

Python3 Elastic Query returns large amount of results.. am i doing this right?

Related topics