As of today we are looking at indexing 150 GB ( approx 1 Billion log lines) per day. This might go up to 500 GB per day in few years time.. We don't want our queries to scan all these lines. Hence we are looking to summarize data at regular intervals ( such as 5m, 1h ).
Which type of summary are you planning to run? Can you give some examples for the numbers you want out of this?
I think with your data being time based you might want to check Time-Based Data | Elasticsearch: The Definitive Guide [2.x] | Elastic for recommendations on how to deal with that particular data type. If you know roughly which time frames you will want to cover in your future queries you should be able to come up with an index design that lets you use standard elasticsearch queries and aggregations limiting the number of data scanned in your request. Essentially it boils down to storing your log lines in separate indexes, one per time frame and defining those same time frames sensibly.
Assume this use case -- A KPI/metric has to be displayed at 5 minute intervals (for last hour), hourly intervals (for last 24 hours), daily interval (for last 30 days). In this case (especially last 24 hours and last 30 days) how the time based indexes are going to help. Isn't that the searches are going to scan all the log lines?
Data summarization is a very handy feature for use cases like these.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.