hi, i'm still not sure i'm doing this right.. so any suggestions would be appreciated..
i have a LARGE data index of log data containing IP addresses.. ports.. and other network traffic info. its millions of entries a month.. and i have a years worth of data. i'm trying to write a python script to build some end of the year data for a report.
as of now i'm just trying to pull the unique IP addresses and the count of how many times those IP addresses. occur over a given time period. seems pretty simple.. i'd like to then pull the previous months data.. then do some comparisons of the two. Below is my search/query function. I submit a date range, and a column/field in the index. it should return the unique values of that field and count.
it returns something.. i do a calculation based on the page['aggregations']['type_count']['value'] to calculate how many pages i'll need if the return size is 10000 (the maximum return size).. i then round that number up.. That number is used in a for loop to pull all the values.. or thats what i WANT it to do.. but it seems as though no matter what that number is, it always pulls some kind of values.. example: if i do the query and the page['aggregations']['type_count']['value'] = 41231, i divide that number by 10,000.. and i get 4.1231.. i round up to 5.. and loop through 5 pages .. and get 5 pages of results.. what i dont understand is .. if i set the loop to 10.. i will still get 10 pages of results.. WHY? does that make sense? if there a better algorithm or process in pulling a large query result from ES?
here is my basic code:
def search(self):
# creates the es object
es = Elasticsearch(hosts=[self.host], timeout=60, max_retries=3, retry_on_timeout=True)
dataDict = {}
# this gets a rough estimate of how many records will be returned
page = es.search(
index=self.index,
scroll='20m',
body={
"query":{
"bool" : {
"must" : {
#self.start_Date: example 2020-10-01, self.end_Date: example 2020-10-31
"range": {"@timestamp": {"gte": self.start_date,"lte": self.end_Date}}
}#must
}#bool
},#query,
"aggs": {
"type_count": {
"cardinality": {
"field": self.field,
"precision_threshold": 40000
} #end cardinality
}#end type count
},#end aggs
}#body
)
print("Cardinality Page Results:", page['aggregations']['type_count']['value'])
unique_results = page['aggregations']['type_count']['value']
page_size = 10000
#calculate how many pages i'm going to need
pages_needed = unique_results / page_size
pages_used = math.ceil(pages_needed) #rounds up for amount of pages needed
print("Math:", pages_needed, " : ", "Rounded", pages_used)
agg_no_pages = pages_used
for i in range(agg_no_pages):
print ("Round:", i)
page = es.search(
index=self.index,
scroll='2m',
size=10000,
body={
"size": 0,
"aggs": {
"unique_ip": {
"terms": {
"field": self.field,
"include": {
"partition": i,
"num_partitions": agg_no_pages
}, # end include
"size": 10000
},
"aggs": {
"ip_count": {
"cardinality": {
"field": self.field
}
}
}
}
}
})
print (i,":",page)
for item in page['aggregations']['unique_ip']['buckets']:
print ("::", item['key'], ":", item['doc_count'])
dataDict[item['key']]=item['doc_count']
self.results=dataDict.copy()