Elasticsearch Data Streams: Update Strategies, Concerns, and Alternatives

Hi Team,
I am stuck and need your help!!
UseCase:
I am using elasticsearch where i am storing activities in a data stream. I want to perform update operation on this Data stream.
There is time based range queries that are fired to fetch data and the number of records is quite high, nearly 1-2 crore.

Problem:
I am currently using logstash for other processes but since logstash does not out of the box provide a way to update data stream it is not fitting my use case as i would need to update activities
On reading i found out that elasticsearch supports updateby query for updating data streams
Reference : How to update data stream?

Ask:

  1. Can i use this feature via elasticsearch client in my code, i.e. I will not use logstash to update the Data stream, is it a correct way to use it?
  2. Could it happen that going forward ES could remove this option of updating data stream via api?
  3. Is there any alternative way in logstash or in general I could solve this use case?

I just want faster retrieval of data in a time based, large data set of data..

Data streams are already being optimised for immutable data, so if you need to update data I would recommend against using data streams.

It would help if you told us a bit about the use case.

  • What kind of data is it?
  • How long do you keep the data?
  • How often do you update the data?
  • How do you update the data? Are you replacing with a new record or just changing a few fields?
  • Do you know the timestamp associated with the initial event when you perform the update?

It would also help if you indicated which version of Elasticsearch you are using.

Hi Christian,
Please find answer to your questions below.

  • What kind of data is it?
    It is reporting data so it contains fields like {activityId, fileID, owner, activityTime, .. etc}

  • How long do you keep the data?
    We keep the data for 8 months

  • How often do you update the data?
    Quite frequently

  • How do you update the data? Are you replacing with a new record or just changing a few fields?
    We are just changing a few fields, in all 4 fields out of nearly 20 fields..

  • Do you know the timestamp associated with the initial event when you perform the update?
    We know the timestamp of the initial/previous/to be updated activity which is activityTime as referred above in question1. This time indicates when the activity was performed.

It would also help if you indicated which version of Elasticsearch you are using.
Certainly, we are using Elasticsearch version 8.6.0

In that case I would recommend using traditional time-based indices where the date is part of the index name and send data to the correct index based on the known timestamp. Depending on data volumes you may want to use daily or monthly indices with a reasonable number of primary shards (aim for a shard size of a few tens of GB).

Hey Christian,
Thanks for your help. I feel the solution you suggested would work for me. Using your suggested approach, I am facing certain challenges in Logstash implementation.
Could you share some light on :

As the discussion is now centered around how to handle this in Logstash, lets continue the discussion in the other thread.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.