Python3 query with large results, best approach

stcdarrell · December 2, 2020, 5:42pm

hi, i'm throwing a huge amount of log data into ES each month.

i'm trying to create a python script to build a report at the end of the month.

my first query i need to make is to query the index and get all the unique IP addresses and the count of how many times those IP addresses hit our network. I have millions of log entries.. and probably 100,000 unique IP addresses.

the query listed below will work, but only returns about 5000, if i increase the size above 5000, it returns 0.

what am i doing wrong? is there a better approach?
def search(self):
# creates the es object
es = Elasticsearch(hosts=[self.host], timeout=60, max_retries=3, retry_on_timeout=True)
dataDict = {}

    es_body=body={
        "size":1, #EVERY example says set this to 0 to only get agg results, but i get an error when i set this to 0.. WHY? 
        "query": {
            "bool": {
                "must": {
                    "range": {"@timestamp": {"gte": self.start_date, "lte": self.end_Date}}
                }  # must
            }  # bool
        },  # query
                "aggs":{
                    "by_ip":{
                        "terms":{
                            "field": self.field,
                            "size":5000
                        }#terms
                    }#by_ip
                },#aggs
                "size":1
        } #end body

    # this gets a rough estimate of how many records will be returned
    page = es.search(
        index=self.index,
        scroll='20m',
        body=es_body
    )
    pprint(page)

warkolm · December 3, 2020, 2:06am

Sharing the error you are seeing would be helpful

stcdarrell · December 3, 2020, 8:13pm

    when size is set to 0, i get this error:

elasticsearch.exceptions.RequestError: RequestError(400, 'action_request_validation_exception', 'Validation Failed: 1: [size] cannot be [0] in a scroll context;')

but it seems every example online has size set to 0.

my basic query:
es_body={ "size":0, "query": { "bool": { "must": { "range": {"@timestamp": {"gte": self.start_date, "lte": self.end_Date}} } # must } # bool }, # query "aggs":{ "by_ip":{ "terms":{ "field": self.field, }#terms }#by_ip },#aggs } #end body

warkolm · December 3, 2020, 8:16pm

It does, yes, but you cannot use size with scroll as the error highlights.

stcdarrell · December 3, 2020, 8:19pm

along the same question, dealing with the same problem.
i'm trying to get the unique IP addresses and the count of how many times they occur in some logs, its ALOT of data. its my understanding the "agg" "cardinality" will give me an estimate of entries so i can calculate the paging i will need.

When i do this i get this result: Estimated Entries: 77352

when i run the same query with "terms" instead of "cardniality" i get a DRASTICALLY different number:
'doc_count_error_upper_bound': 361980,
'sum_other_doc_count': 147144334}}

what am i missing here.. ? shouldnt these be close to the same?

warkolm · December 3, 2020, 9:34pm

Check out https://www.elastic.co/guide/en/elasticsearch/reference/7.10/search-aggregations-metrics-cardinality-aggregation.html#_counts_are_approximate

Emanuil · December 3, 2020, 11:25pm

You might find this blog post on high cardinality terms aggregation optimisations useful. This forum topic may also help you, in particular the examples of aggregation partitioning being discussed.

stcdarrell · December 6, 2020, 2:58pm

thank you that helped a lot, i think i got it.. just for future reference.. this seems to work.. its just test code, so its not pretty but you'll get the idea.

def query_test(self,es_host, es_index, es_field, startDate, endDate):
    # Define a default Elasticsearch client


    client = connections.create_connection(hosts=[es_host], timeout=90)
    a2=A('cardinality', field=es_field)
    s1 = Search(using=client, index=es_index).extra(size=0).filter('range' ,  **{'@timestamp': {'gte': startDate , 'lt': endDate}})

    s1.aggs.bucket('cardinality', a2)
    s1_results = s1.execute()
    print (s1_results)
    cardinality=s1_results.aggregations.cardinality.value
    print ("Cardinality", cardinality)

    i = 0
    partitions = (cardinality/9000)
    print("Partitions Float:", partitions)
    partitions=math.ceil(partitions)
    print ("Partitions Rounded:", partitions)

    itemTotal=0
    dataDict={}
    while i < partitions:
        s = Search(using=client, index=es_index).extra(size=0).filter('range' ,  **{'@timestamp': {'gte': startDate , 'lt': endDate}})
        a=A('terms', field=es_field, size=99999999, include={"partition": i, "num_partitions": partitions})
        s.aggs.bucket('catagory_terms', a)
        s_results = s.execute()

        for item in s_results.aggregations.catagory_terms.buckets:
            #print (item['key'], ":", item['doc_count'])
            dataDict[item['key']]=item['doc_count']
        i = i + 1

        #print (type(s), s)
        s_dict=s_results.to_dict()
        #print (s_dict['aggregations']['catagory_terms']['buckets'])
        print("  Query Iterator:", i, " Total:", len(s_dict['aggregations']['catagory_terms']['buckets']))
        itemTotal+=len(s_dict['aggregations']['catagory_terms']['buckets'])
    print ("Total Items:", itemTotal)
    print ("Items in Data Dict:", len(dataDict.keys()))

    for item in dataDict:
        #do work here
        print (item, ":", dataDict[item])

stcdarrell · December 6, 2020, 3:00pm

this works.. and i'm getting the results i should get. It seems to work fine on fields with text in them. I have a field that is a number, its a port number. such as:
21 : FTP
80: HTTP

etc. it crashes with this numerical field. i'd tried src_port.keyword, that doesnt work either.. is the best approach just to re-map that field to text?

thank you

rugenl · December 6, 2020, 3:09pm

Beware of https://www.elastic.co/guide/en/elasticsearch/reference/current/search-settings.html#search-settings-max-buckets. I think your search will be limited to that number of buckets, whatever your environment setting might be.

I have some python code that uses filtering queries and scan, then do the bucketing in Python.

stcdarrell · December 6, 2020, 7:01pm

thank you, so you use the "scan" to see how many buckets you will need? then use that calculation to parition ? how is that different than cardinatlity?

system · January 3, 2021, 7:01pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Python3 Elastic Query returns large amount of results.. am i doing this right? Elasticsearch	1	344	December 29, 2020
Python3 Large Query Cardinality & Query return different results Elasticsearch	5	329	December 15, 2020
Increase size limit - Python ElasticSearch Elasticsearch language-clients	8	991	March 22, 2021
Indexing Aggregate Results Logstash	1	231	November 4, 2019
Elasticsearch Result Size Elasticsearch	2	6036	July 5, 2017

Python3 query with large results, best approach

Related topics