hi, i'm throwing a huge amount of log data into ES each month.
i'm trying to create a python script to build a report at the end of the month.
my first query i need to make is to query the index and get all the unique IP addresses and the count of how many times those IP addresses hit our network. I have millions of log entries.. and probably 100,000 unique IP addresses.
the query listed below will work, but only returns about 5000, if i increase the size above 5000, it returns 0.
what am i doing wrong? is there a better approach?
def search(self):
# creates the es object
es = Elasticsearch(hosts=[self.host], timeout=60, max_retries=3, retry_on_timeout=True)
dataDict = {}
es_body=body={
"size":1, #EVERY example says set this to 0 to only get agg results, but i get an error when i set this to 0.. WHY?
"query": {
"bool": {
"must": {
"range": {"@timestamp": {"gte": self.start_date, "lte": self.end_Date}}
} # must
} # bool
}, # query
"aggs":{
"by_ip":{
"terms":{
"field": self.field,
"size":5000
}#terms
}#by_ip
},#aggs
"size":1
} #end body
# this gets a rough estimate of how many records will be returned
page = es.search(
index=self.index,
scroll='20m',
body=es_body
)
pprint(page)