I brand new to the concept of data streams, and I'm trying to understand the concepts & limits behind them. So sorry in advance if my question is a dumb one.
Let's say:
I configure a data stream "foo" with rollover of "max age = 1 hour" ; so every hour a new backing index is created ; starting with foo-000001 today at 1:00AM
I have 2 applications, APP1 & APP2, on two different servers, continuously logging data in elasticsearch via their respective filebeats, collecting app's log files ; each application writes one log every second.
At 7:00AM, data stream’s write index is foo-000007. and both applications continue to send logs every second.
At 7:58AM, because of a temporary failure of part of my network, filebeat of APP1 is no more able to reach elasticsearch (while APP2 continues to write logs in the data stream).
At 8:00AM, rollover happens and the new data stream’s write index becomes foo-000008 ; APP2 continues to write logs in it while APP1 doesn't.
At 8:05AM, network issue ends. APP1's filebeat starts again to send the logs to elasticsearch. But it starts with logs of 7:58AM.
=> Given
the write's backing index of data stream moved in the meanwhile to foo-000008
APP2 has already filled the data stream with logs from 8:00AM to 8:05AM
data streams are "append-only time series data"
=> will elasticsearch refuse to store the logs sent by filebeat of APP1, with @timestamp between 7:58AM and 8:05AM ? And thus will I loose all logs of APP1 between 7:58AM and 8:05AM ?
No elasticsearch will write the documents into whatever the backing current backing index is. The timestamp of the document written is not gated/checked upon writing ... Yes in general documents end up in a backing index that is relative to the time, but there are times when there are lags disruptions, etc delays in documents being written and they will be written into the current backing index. There is no guarantee (nor actual requirement that the document timestamp matches the backing index rollover timing) There are some "smarts" to help elasticsearch optimize when searching keeping track of some of the min / max timestamps in the backing indices.
When you search the data you will most likely be searching the Data Stream via the Data View with a time filter, again there is logic to optimize the search.. what it searched and what is returned.
With Respect to "Append Only"
Append-only
Data streams are designed for use cases where existing data is rarely, if ever, updated. You cannot send update or deletion requests for existing documents directly to a data stream. Instead, use the update by query and delete by query APIs.
If needed, you can update or delete documents by submitting requests directly to the document’s backing index.
Ok. The more I was thinking about it, the more it makes sense, as otherwise, the same problem would arise a few milliseconds before & after any rollover.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.