There are few components in the Elastic stack, which I'm not sure I understand the difference and when to use which.
for ML I use (currently) datafeed with pre-defined query (in practice it seems to aggregate metrics by a given time frame)
Rollup adds a functionality for aggregate raw indices and save query time/ space..
And in the latest version, after upgrading from 6.7 to 7.5, there is a new feature called 'data frames', which sounds similar..
So.. if my purpose is to aggregate data into smaller indices which can accept queries, and also run ML jobs. which should I choose, and why there are few features with similar functionality?
I believe what you want is Transforms which allow you to convert existing Elasticsearch indices into summarized indices.
The other functionality you mention is very use case specific and probably won't do what you need based on your stated purpose. Datafeeds are only used for feeding data into ML anomaly detection jobs (and don't create other indices). Rollups are used to aggregate metric indices to reduce storage, but have a special _search endpoint that allows you to query across raw and summarized metric data.
On this specific point - roll ups are a form of compaction that is geared specifically towards grouping documents based on time units while transforms are typically grouping documents on a choice of entity like a customer ID.
A web session is an example of an entity that can span time units so time-based roll-ups are not an appropriate mechanism for grouping that information.
A monthly roll-up is not grouped on a single entity key (it's a range of timestamps) so transforms are not appropriate.
What you might want to summarise for entities vs time-groups might be similar (counts, flags etc) but the unit that you group things around is fundamentally different.
Yes, percentiles are on our roadmap including handling for functions that return multiple values. It is something we are keen to do, however we do not have committed timeframes for this yet.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.