Hi,
We are currently utilizing ES for analyzing the last 24 hours of data.
The arrival rate of data is of the order of a few hundreds per 10 second
interval with each document having a timestamp associated with it.
We now need to be able to analyze data over a week and to reduce the
amount of space required we plan to retain the 24 hr TTL on the documents
but aggregate data into one document per minute to retrieve data that is
dated beyond 24 hours and upto 7 days. All fields in the document need to
be aggregated.
So,
Is there any out of the box features that I can use to achieve this kind
of roll ups?
What is the best approach (preferably a time-tested one if someone has
already done this)?
Some approaches we were contemplating:
Aggregating the data in real time (outside ES) and store the aggregated
data into ES
Periodically (say once in 30 mins) run aggregation queries and write
back responses to ES
Periodically (say once in 30 mins) read new documents using time range,
aggregate and store back aggregated data in bulk into ES. Maybe use
streaming or paged read of documents to aggregate them....
Maybe use a combination of 1 and (2 or 3) so that real time data gets
aggregated and data that is delayed (may happen) due to some reason can be
updated into the aggregated data using the Update API of ES?
We aggregate outside of ES, in memory, and push in bulk. We could still
roll up the data stored in ES later on if we wanted to, but reading from ES
could get expensive.
On Monday, September 22, 2014 11:08:09 PM UTC-4, Srinath C wrote:
Hi,
We are currently utilizing ES for analyzing the last 24 hours of data.
The arrival rate of data is of the order of a few hundreds per 10 second
interval with each document having a timestamp associated with it.
We now need to be able to analyze data over a week and to reduce the
amount of space required we plan to retain the 24 hr TTL on the documents
but aggregate data into one document per minute to retrieve data that is
dated beyond 24 hours and upto 7 days. All fields in the document need to
be aggregated.
So,
Is there any out of the box features that I can use to achieve this
kind of roll ups?
What is the best approach (preferably a time-tested one if someone has
already done this)?
Some approaches we were contemplating:
Aggregating the data in real time (outside ES) and store the aggregated
data into ES
Periodically (say once in 30 mins) run aggregation queries and write
back responses to ES
Periodically (say once in 30 mins) read new documents using time range,
aggregate and store back aggregated data in bulk into ES. Maybe use
streaming or paged read of documents to aggregate them....
Maybe use a combination of 1 and (2 or 3) so that real time data gets
aggregated and data that is delayed (may happen) due to some reason can be
updated into the aggregated data using the Update API of ES?
On Tuesday, 23 September 2014 21:47:42 UTC+5:30, Otis Gospodnetic wrote:
Hi,
We aggregate outside of ES, in memory, and push in bulk. We could still
roll up the data stored in ES later on if we wanted to, but reading from ES
could get expensive.
On Monday, September 22, 2014 11:08:09 PM UTC-4, Srinath C wrote:
Hi,
We are currently utilizing ES for analyzing the last 24 hours of
data. The arrival rate of data is of the order of a few hundreds per 10
second interval with each document having a timestamp associated with it.
We now need to be able to analyze data over a week and to reduce the
amount of space required we plan to retain the 24 hr TTL on the documents
but aggregate data into one document per minute to retrieve data that is
dated beyond 24 hours and upto 7 days. All fields in the document need to
be aggregated.
So,
Is there any out of the box features that I can use to achieve this
kind of roll ups?
What is the best approach (preferably a time-tested one if someone has
already done this)?
Some approaches we were contemplating:
Aggregating the data in real time (outside ES) and store the
aggregated data into ES
Periodically (say once in 30 mins) run aggregation queries and write
back responses to ES
Periodically (say once in 30 mins) read new documents using time
range, aggregate and store back aggregated data in bulk into ES. Maybe use
streaming or paged read of documents to aggregate them....
Maybe use a combination of 1 and (2 or 3) so that real time data gets
aggregated and data that is delayed (may happen) due to some reason can be
updated into the aggregated data using the Update API of ES?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.