Could you please help me to understand the use of data stream over normal index on which we can apply template and ILM policy to rollover to a new Index once it meets the policy defined in ILM.
Because while creating data stream we need to define ILM and template also. Then what is the advantage of using data stream over index (where we can apply ILM and template to rollover).
In short Data Streams are the "New / Go Forward" implementation for time series data. Any new Beat or Elastic Agent will create data streams.
You are right you can do most capabilities with Index + Alias + Template + ILM but data streams abstract some of that (particularly the whole write alias part)
Data streams encapsulate a long history of best practices for time series data.
Also there are some optimizations under the covers with data streams for searching etc since data streams by definition are append only.
There are some other features that are supported as when such as TSDS etc for metrics.
Here is a nice summary: (Credit to Bard ... but I looked it over)
1. Simplified Management of Time-Series Data:
- Automatic Index Management: Data streams automatically create, manage, and roll over underlying indices based on time or size, eliminating manual index creation and rotation tasks.
- Continuous Writes: You can continuously write data to a single logical stream without worrying about index boundaries or rollovers, simplifying ingestion pipelines for continuous data sources.
2. Optimized Performance for Time-Series Queries:
- Seamless Data Slicing: Data streams are optimized for time-based queries, allowing for efficient retrieval of data within specific time ranges without complex index management.
- Reduced Search Overhead: By targeting queries to specific time periods, data streams minimize the search scope, leading to faster query response times.
3. Streamlined Data Lifecycle Management:
- Integrated ILM: Data streams seamlessly integrate with Index Lifecycle Management (ILM) policies, enabling automatic data tiering, archiving, and deletion based on configurable rules, reducing storage costs and improving performance.
- Simplified Data Retention and Archiving: You can easily set up retention periods and move older data to cold storage or delete it based on ILM policies, ensuring efficient data management over time.
4. Enhanced Scalability and Availability:
- Shard Allocation Awareness: Data streams consider shard allocation when rolling over indices, helping to distribute data evenly across nodes for better performance and scalability.
- Improved Load Balancing: This distribution of data across nodes contributes to better load balancing and failover capabilities, reducing the impact of node failures and maintaining high availability.
5. Streamlined Data Ingestion:
- Centralized Write Target: Data streams provide a single logical endpoint for data ingestion, simplifying the process for continuous data sources like logs, metrics, and events.
- Continuous Data Flow: This eliminates the need to manage multiple indices manually, making it easier to ingest and manage large volumes of streaming data.
6. Integration with Data Tiering:
- Cost-Effective Storage: Data streams leverage data tiers (e.g., hot, warm, cold) to optimize storage costs and performance, allowing you to store less-frequently accessed data on less expensive hardware.
In general, data streams are ideal for managing time-series data in Elasticsearch, offering significant benefits in terms of simplified management, enhanced performance, streamlined data lifecycle, and improved scalability
This helps to understand the overall benefits of DS, however, we are still not able to appreciate how it scores over "Index + Alias + Template + ILM" approach (lets call this non-DS). And we think that is because the steps we perform seem to be the same for both approaches, except that we do not have to create the alias.
Automatic Index Management - we do not see it as "automatic", because we still need to create ILM which was the case with non-DS as well. Isnt it?
Streamlined Data Ingestion - in non-DS too, alias acts as a single logical endpoint for data ingestion.
Continuous Data Flow - multiple indices anyways are not required to be managed manually even in non-DS.
Perhaps Optimized Performance & Enhanced Scalability and Availability , which are behind the curtain benefits, are the main differences and hence we are not able to appreciate the additional benefits that DS brings over non-DS?
When working with Data Streams new lifecycle, how would you specify the tier preference as the index ages?
Although being able to see the data retention in the UI per data stream in the new technical preview is great, as far as I can tell there is no way to actually specify the lifecycle "stages" as with ILMs hot-warm-cold.
I think that is just the very first initial tech preview. When I look at that, it looks like it is just Hot -> Delete. I would suspect more will be coming in the future... but it is not clear to me... I would let this feature mature a bit.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.