hi, i'm throwing a huge amount of log data into ES each month.
i'm trying to create a python script to build a report at the end of the month.
my first query i need to make is to query the index and get all the unique IP addresses and the count of how many times those IP addresses hit our network. I have millions of log entries.. and probably 100,000 unique IP addresses.
the query listed below will work, but only returns about 5000, if i increase the size above 5000, it returns 0.
what am i doing wrong? is there a better approach?
def search(self):
# creates the es object
es = Elasticsearch(hosts=[self.host], timeout=60, max_retries=3, retry_on_timeout=True)
dataDict = {}
es_body=body={
"size":1, #EVERY example says set this to 0 to only get agg results, but i get an error when i set this to 0.. WHY?
"query": {
"bool": {
"must": {
"range": {"@timestamp": {"gte": self.start_date, "lte": self.end_Date}}
} # must
} # bool
}, # query
"aggs":{
"by_ip":{
"terms":{
"field": self.field,
"size":5000
}#terms
}#by_ip
},#aggs
"size":1
} #end body
# this gets a rough estimate of how many records will be returned
page = es.search(
index=self.index,
scroll='20m',
body=es_body
)
pprint(page)
elasticsearch.exceptions.RequestError: RequestError(400, 'action_request_validation_exception', 'Validation Failed: 1: [size] cannot be [0] in a scroll context;')
but it seems every example online has size set to 0.
along the same question, dealing with the same problem.
i'm trying to get the unique IP addresses and the count of how many times they occur in some logs, its ALOT of data. its my understanding the "agg" "cardinality" will give me an estimate of entries so i can calculate the paging i will need.
When i do this i get this result: Estimated Entries: 77352
when i run the same query with "terms" instead of "cardniality" i get a DRASTICALLY different number:
'doc_count_error_upper_bound': 361980,
'sum_other_doc_count': 147144334}}
what am i missing here.. ? shouldnt these be close to the same?
thank you that helped a lot, i think i got it.. just for future reference.. this seems to work.. its just test code, so its not pretty but you'll get the idea.
def query_test(self,es_host, es_index, es_field, startDate, endDate):
# Define a default Elasticsearch client
client = connections.create_connection(hosts=[es_host], timeout=90)
a2=A('cardinality', field=es_field)
s1 = Search(using=client, index=es_index).extra(size=0).filter('range' , **{'@timestamp': {'gte': startDate , 'lt': endDate}})
s1.aggs.bucket('cardinality', a2)
s1_results = s1.execute()
print (s1_results)
cardinality=s1_results.aggregations.cardinality.value
print ("Cardinality", cardinality)
i = 0
partitions = (cardinality/9000)
print("Partitions Float:", partitions)
partitions=math.ceil(partitions)
print ("Partitions Rounded:", partitions)
itemTotal=0
dataDict={}
while i < partitions:
s = Search(using=client, index=es_index).extra(size=0).filter('range' , **{'@timestamp': {'gte': startDate , 'lt': endDate}})
a=A('terms', field=es_field, size=99999999, include={"partition": i, "num_partitions": partitions})
s.aggs.bucket('catagory_terms', a)
s_results = s.execute()
for item in s_results.aggregations.catagory_terms.buckets:
#print (item['key'], ":", item['doc_count'])
dataDict[item['key']]=item['doc_count']
i = i + 1
#print (type(s), s)
s_dict=s_results.to_dict()
#print (s_dict['aggregations']['catagory_terms']['buckets'])
print(" Query Iterator:", i, " Total:", len(s_dict['aggregations']['catagory_terms']['buckets']))
itemTotal+=len(s_dict['aggregations']['catagory_terms']['buckets'])
print ("Total Items:", itemTotal)
print ("Items in Data Dict:", len(dataDict.keys()))
for item in dataDict:
#do work here
print (item, ":", dataDict[item])
this works.. and i'm getting the results i should get. It seems to work fine on fields with text in them. I have a field that is a number, its a port number. such as:
21 : FTP
80: HTTP
etc. it crashes with this numerical field. i'd tried src_port.keyword, that doesnt work either.. is the best approach just to re-map that field to text?
thank you, so you use the "scan" to see how many buckets you will need? then use that calculation to parition ? how is that different than cardinatlity?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.