What is the fastest way to read all the records in an index?


(Peter Trei) #1

New to the list, semi-new to ELK. Apologies if this is a FAQ, but I can't find an answer.

I'm processing proxy logs. Logstash does just fine in parsing them, and putting them into daily logstash-[date] indices.

However, I want to programmatically walk through all the records in each day's index, and collapse some of the data into historical summaries. The processing is beyond what I can do in an Elasticsearch query. I have a python program which performs this operation, but the performance isn't adequate.

The biggest bottleneck is reading the records from the logstash-[date] indices. I'm using the elasticsearch-py interface, which provides a wrapper for the 'scroll' API. Here's the relevant lines....

import elasticsearch
import elasticsearch.helpers as helpers

es = elasticsearch.Elasticsearch(retry_on_timeout=True)
# sets up global ES handle

#main processing loop 
def process_index(index_name)
   global es
   query_body = '{"size": 10000, "query": {"match_all":{}}}'
   scanResp = helpers.scan(client=es,query=query_body,scroll="5m",index=index_name,timeout="5m")
   resp={}
   for resp in scanResp:
      DO STUFF FOR ONE RECORD

The Bulk API appears to be useful only for writing, not reading. I'm using it to write out the collapsed data I generate.

But is there any way I can read all the records from an index with higher speed? At the moment, even if null out the processing, I'm reading about 2500 records/second. When my daily logs have 60 million entries, that's painfully slow.

thanks in advance,

Peter Trei


(Mark Walkom) #2

Scroll is what you want, have you tried reducing the size to see if that helps performance?


(Amit Pandita) #3

@ptrei Just curious to know the configuration of your server?

I am also running into the same issue and I am seeing even worse performance. It would be helpful in case you share your current configuration.


(system) #4