Best approach to summarize data?


(Venkataraman K S) #1

We need to summarize data from raw index. We could think of two approaches as below:

  1. Writing a program in any of the languages (Java,Python etc) using the exposed APIs and schedule the same through cron.
  2. To write a search_template in mustache or a script in groovy(other supported languages) in config/scripts and trigger it using a cron.

Could you please suggest which is the best approach based on performance.

Also please suggest if there are any other better approaches..

Thanks


(Mark Walkom) #2

Can you elaborate on why?
ES does aggs natively, plus with best_compression in 2.0 you should be in a good place in terms of storage.


(Doug Turnbull) #3

What are you trying to summarize? Index size?


(Venkataraman K S) #4

Thanks for your response!

As of today we are looking at indexing 150 GB ( approx 1 Billion log lines) per day. This might go up to 500 GB per day in few years time.. We don't want our queries to scan all these lines. Hence we are looking to summarize data at regular intervals ( such as 5m, 1h ).

Please let us know your suggestions!


(Isabel Drost-Fromm) #5

Which type of summary are you planning to run? Can you give some examples for the numbers you want out of this?

I think with your data being time based you might want to check https://www.elastic.co/guide/en/elasticsearch/guide/current/time-based.html for recommendations on how to deal with that particular data type. If you know roughly which time frames you will want to cover in your future queries you should be able to come up with an index design that lets you use standard elasticsearch queries and aggregations limiting the number of data scanned in your request. Essentially it boils down to storing your log lines in separate indexes, one per time frame and defining those same time frames sensibly.

Hope this helps,
Isabel


(Mahesh) #6

Assume this use case -- A KPI/metric has to be displayed at 5 minute intervals (for last hour), hourly intervals (for last 24 hours), daily interval (for last 30 days). In this case (especially last 24 hours and last 30 days) how the time based indexes are going to help. Isn't that the searches are going to scan all the log lines?

Data summarization is a very handy feature for use cases like these.


(system) #7