Best approach to summarize data?

venky1987 · November 21, 2015, 1:04pm

We need to summarize data from raw index. We could think of two approaches as below:

Writing a program in any of the languages (Java,Python etc) using the exposed APIs and schedule the same through cron.
To write a search_template in mustache or a script in groovy(other supported languages) in config/scripts and trigger it using a cron.

Could you please suggest which is the best approach based on performance.

Also please suggest if there are any other better approaches..

Thanks

warkolm · November 21, 2015, 9:08pm

Can you elaborate on why?
ES does aggs natively, plus with best_compression in 2.0 you should be in a good place in terms of storage.

softwaredoug · November 21, 2015, 10:03pm

What are you trying to summarize? Index size?

venky1987 · November 23, 2015, 11:23am

Thanks for your response!

As of today we are looking at indexing 150 GB ( approx 1 Billion log lines) per day. This might go up to 500 GB per day in few years time.. We don't want our queries to scan all these lines. Hence we are looking to summarize data at regular intervals ( such as 5m, 1h ).

Please let us know your suggestions!

mainec · November 23, 2015, 11:59am

Which type of summary are you planning to run? Can you give some examples for the numbers you want out of this?

I think with your data being time based you might want to check Time-Based Data | Elasticsearch: The Definitive Guide [2.x] | Elastic for recommendations on how to deal with that particular data type. If you know roughly which time frames you will want to cover in your future queries you should be able to come up with an index design that lets you use standard elasticsearch queries and aggregations limiting the number of data scanned in your request. Essentially it boils down to storing your log lines in separate indexes, one per time frame and defining those same time frames sensibly.

Hope this helps,
Isabel

Strive · November 23, 2015, 12:24pm

Assume this use case -- A KPI/metric has to be displayed at 5 minute intervals (for last hour), hourly intervals (for last 24 hours), daily interval (for last 30 days). In this case (especially last 24 hours and last 30 days) how the time based indexes are going to help. Isn't that the searches are going to scan all the log lines?

Data summarization is a very handy feature for use cases like these.

Topic		Replies	Views
Summarize elasticsearch data over the time Elasticsearch	5	1206	July 5, 2017
Best practice to save aggregated data to elasticsearch for long time storage? Elasticsearch	1	3321	December 13, 2017
Data aggregation and storage for specific time interval Elasticsearch	1	498	July 5, 2017
How to do summary indexing in Elasticsearch Elasticsearch	7	2047	September 15, 2020
Faster to index raw data or load a static data dump Elasticsearch	9	1141	October 4, 2020

Best approach to summarize data?

Related topics