What is the fastest way to read all the records in an index?

New to the list, semi-new to ELK. Apologies if this is a FAQ, but I can't find an answer.

I'm processing proxy logs. Logstash does just fine in parsing them, and putting them into daily logstash-[date] indices.

However, I want to programmatically walk through all the records in each day's index, and collapse some of the data into historical summaries. The processing is beyond what I can do in an Elasticsearch query. I have a python program which performs this operation, but the performance isn't adequate.

The biggest bottleneck is reading the records from the logstash-[date] indices. I'm using the elasticsearch-py interface, which provides a wrapper for the 'scroll' API. Here's the relevant lines....

import elasticsearch
import elasticsearch.helpers as helpers

es = elasticsearch.Elasticsearch(retry_on_timeout=True)
# sets up global ES handle

#main processing loop 
def process_index(index_name)
   global es
   query_body = '{"size": 10000, "query": {"match_all":{}}}'
   scanResp = helpers.scan(client=es,query=query_body,scroll="5m",index=index_name,timeout="5m")
   resp={}
   for resp in scanResp:
      DO STUFF FOR ONE RECORD

The Bulk API appears to be useful only for writing, not reading. I'm using it to write out the collapsed data I generate.

But is there any way I can read all the records from an index with higher speed? At the moment, even if null out the processing, I'm reading about 2500 records/second. When my daily logs have 60 million entries, that's painfully slow.

thanks in advance,

Peter Trei

Scroll is what you want, have you tried reducing the size to see if that helps performance?

@ptrei Just curious to know the configuration of your server?

I am also running into the same issue and I am seeing even worse performance. It would be helpful in case you share your current configuration.