New to the list, semi-new to ELK. Apologies if this is a FAQ, but I can't find an answer.
I'm processing proxy logs. Logstash does just fine in parsing them, and putting them into daily logstash-[date] indices.
However, I want to programmatically walk through all the records in each day's index, and collapse some of the data into historical summaries. The processing is beyond what I can do in an Elasticsearch query. I have a python program which performs this operation, but the performance isn't adequate.
The biggest bottleneck is reading the records from the logstash-[date] indices. I'm using the elasticsearch-py interface, which provides a wrapper for the 'scroll' API. Here's the relevant lines....
import elasticsearch
import elasticsearch.helpers as helpers
es = elasticsearch.Elasticsearch(retry_on_timeout=True)
# sets up global ES handle
#main processing loop
def process_index(index_name)
global es
query_body = '{"size": 10000, "query": {"match_all":{}}}'
scanResp = helpers.scan(client=es,query=query_body,scroll="5m",index=index_name,timeout="5m")
resp={}
for resp in scanResp:
DO STUFF FOR ONE RECORD
The Bulk API appears to be useful only for writing, not reading. I'm using it to write out the collapsed data I generate.
But is there any way I can read all the records from an index with higher speed? At the moment, even if null out the processing, I'm reading about 2500 records/second. When my daily logs have 60 million entries, that's painfully slow.
thanks in advance,
Peter Trei