Python3 query with large results, best approach

hi, i'm throwing a huge amount of log data into ES each month.

i'm trying to create a python script to build a report at the end of the month.

my first query i need to make is to query the index and get all the unique IP addresses and the count of how many times those IP addresses hit our network. I have millions of log entries.. and probably 100,000 unique IP addresses.

the query listed below will work, but only returns about 5000, if i increase the size above 5000, it returns 0.

what am i doing wrong? is there a better approach?
def search(self):
# creates the es object
es = Elasticsearch(hosts=[self.host], timeout=60, max_retries=3, retry_on_timeout=True)
dataDict = {}

    es_body=body={
        "size":1, #EVERY example says set this to 0 to only get agg results, but i get an error when i set this to 0.. WHY? 
        "query": {
            "bool": {
                "must": {
                    "range": {"@timestamp": {"gte": self.start_date, "lte": self.end_Date}}
                }  # must
            }  # bool
        },  # query
                "aggs":{
                    "by_ip":{
                        "terms":{
                            "field": self.field,
                            "size":5000
                        }#terms
                    }#by_ip
                },#aggs
                "size":1
        } #end body

    # this gets a rough estimate of how many records will be returned
    page = es.search(
        index=self.index,
        scroll='20m',
        body=es_body
    )
    pprint(page)

Sharing the error you are seeing would be helpful :slight_smile:

    when size is set to 0, i get this error:

elasticsearch.exceptions.RequestError: RequestError(400, 'action_request_validation_exception', 'Validation Failed: 1: [size] cannot be [0] in a scroll context;')

but it seems every example online has size set to 0.

my basic query:
es_body={ "size":0, "query": { "bool": { "must": { "range": {"@timestamp": {"gte": self.start_date, "lte": self.end_Date}} } # must } # bool }, # query "aggs":{ "by_ip":{ "terms":{ "field": self.field, }#terms }#by_ip },#aggs } #end body

It does, yes, but you cannot use size with scroll as the error highlights.

along the same question, dealing with the same problem.
i'm trying to get the unique IP addresses and the count of how many times they occur in some logs, its ALOT of data. its my understanding the "agg" "cardinality" will give me an estimate of entries so i can calculate the paging i will need.

When i do this i get this result: Estimated Entries: 77352

when i run the same query with "terms" instead of "cardniality" i get a DRASTICALLY different number:
'doc_count_error_upper_bound': 361980,
'sum_other_doc_count': 147144334}}

what am i missing here.. ? shouldnt these be close to the same?

Check out https://www.elastic.co/guide/en/elasticsearch/reference/7.10/search-aggregations-metrics-cardinality-aggregation.html#_counts_are_approximate

You might find this blog post on high cardinality terms aggregation optimisations useful. This forum topic may also help you, in particular the examples of aggregation partitioning being discussed.

thank you that helped a lot, i think i got it.. just for future reference.. this seems to work.. its just test code, so its not pretty but you'll get the idea.

def query_test(self,es_host, es_index, es_field, startDate, endDate):
    # Define a default Elasticsearch client


    client = connections.create_connection(hosts=[es_host], timeout=90)
    a2=A('cardinality', field=es_field)
    s1 = Search(using=client, index=es_index).extra(size=0).filter('range' ,  **{'@timestamp': {'gte': startDate , 'lt': endDate}})

    s1.aggs.bucket('cardinality', a2)
    s1_results = s1.execute()
    print (s1_results)
    cardinality=s1_results.aggregations.cardinality.value
    print ("Cardinality", cardinality)

    i = 0
    partitions = (cardinality/9000)
    print("Partitions Float:", partitions)
    partitions=math.ceil(partitions)
    print ("Partitions Rounded:", partitions)

    itemTotal=0
    dataDict={}
    while i < partitions:
        s = Search(using=client, index=es_index).extra(size=0).filter('range' ,  **{'@timestamp': {'gte': startDate , 'lt': endDate}})
        a=A('terms', field=es_field, size=99999999, include={"partition": i, "num_partitions": partitions})
        s.aggs.bucket('catagory_terms', a)
        s_results = s.execute()

        for item in s_results.aggregations.catagory_terms.buckets:
            #print (item['key'], ":", item['doc_count'])
            dataDict[item['key']]=item['doc_count']
        i = i + 1

        #print (type(s), s)
        s_dict=s_results.to_dict()
        #print (s_dict['aggregations']['catagory_terms']['buckets'])
        print("  Query Iterator:", i, " Total:", len(s_dict['aggregations']['catagory_terms']['buckets']))
        itemTotal+=len(s_dict['aggregations']['catagory_terms']['buckets'])
    print ("Total Items:", itemTotal)
    print ("Items in Data Dict:", len(dataDict.keys()))

    for item in dataDict:
        #do work here
        print (item, ":", dataDict[item])

this works.. and i'm getting the results i should get. It seems to work fine on fields with text in them. I have a field that is a number, its a port number. such as:
21 : FTP
80: HTTP

etc. it crashes with this numerical field. i'd tried src_port.keyword, that doesnt work either.. is the best approach just to re-map that field to text?

thank you

Beware of https://www.elastic.co/guide/en/elasticsearch/reference/current/search-settings.html#search-settings-max-buckets. I think your search will be limited to that number of buckets, whatever your environment setting might be.

I have some python code that uses filtering queries and scan, then do the bucketing in Python.

thank you, so you use the "scan" to see how many buckets you will need? then use that calculation to parition ? how is that different than cardinatlity?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.