Move data between data stream backing indexes

We collect telemetry from our systems and store each event in a service bus, ready for ingestion. We have also setup a data stream for ingesting telemetry events, with a corresponding ILM policy with automatic rollover, such that a new backing index would be created at least once a day. Events are read from the service bus and indexed in the data stream.

Suppose that a problem occurs and the telemetry ingestion to the data-stream is interrupted, is later resumed, but the timestamp of the events no longer match the date of the backing index. I.e. events from 2022-08-21 are indexed in a backing index .ds-telemetries-2022.08.24*, instead of .ds-telemetries-2022-08-21*. Is it at all possible to move existing data from one backing index to another?

The reason for the question is that we want to be able to query specific backing indexes based on the creation date of the backing index, instead of querying the timestamp of the event, and be able to assume that all telemetry events for a specific date can be found in its corresponding backing index. A query like this would be able to hit all telemetry events for a specified month.

GET /.ds-telemetries-2022.08*/_search
{
  "size": 20,
  "seq_no_primary_term": true,
  "query": {
    "match_all": { }
  }
}

You can't move documents between indices, what you can do is reindex the documents in the backing index you want using a query to filter for the documents that need to be moved.

After the reindex you will need to manually delete the documents in the source index, you could use a delete_by_query with the same query used in the reindex.

But depending on the number of events this can be a lot of work to do.

If this happens frequently, I think that you should not use data streams, but manage time series without it, like described in the documentation.

I think this is not a correct assumption, Elasticsearch times are always in UTC, unless you are in the UTC timezone or only using dates and times in UTC, you will have documents from different days in the daily backing indices.

For example, I'm on UTC - 03:00, so may daily indices are created at 21:00 local time, all the documents from 21:00:00 until 23:59:59 of August 31st will be in the September 1st index.

If in September 1st I query something-2022.08.* without a time range I will also have documents from July and miss some documents from the last day of the month, if this is a problem depends on the use case.

If you are in the UTC timezone, then you do not have this problem.

2 Likes

Hi, thanks for your reply!

Good foresight on the UTC date issue. Fortunately, our telemetry timestamps are in UTC.

Unfortunately, your suggestion to reindex the data didn't seem to work. It seems targeting data stream backing indeces with a reindex isn't allowed.

The query

POST /_reindex
{
  "source": {
    "index": ".ds-telemetries-2022.08.23-000003"
  },
  "dest": {
    "index": ".ds-telemetries-2022.08.24-000004",
    "op_type": "create"
  }
}

resulted in

...
{
    "failures": [
        {
            "index": ".ds-telemetries-2022.08.24-000004",
            "type": "_doc",
            "id": "yIT0zoIBh_JGBgavK4BG",
            "cause": {
                "type": "illegal_argument_exception",
                "reason": "index request with op_type=create targeting backing indices is disallowed, target corresponding data stream [telemetries] instead"
            },
            "status": 400
        }
    ],
...
}

We also don't want to target the data stream directly, as this would only reindex the data in the current writing index, not our intended target index.

To be fair, the documentation for data streams explicitly states that writing to a backing index which isn't the current writing index is not allowed.
Data streams | Elasticsearch Guide [8.3] | Elastic

You cannot add new documents to other backing indices, even by sending requests directly to the index.

Despite this, do you, or someone else, know of another workaround?

As a side note, we also consider dumping data streams in favor of index aliases, which isn't as restrictive.
Tutorial: Automate rollover with ILM | Elasticsearch Guide [8.3] | Elastic

We are however worried that this might impact query performance later.

Oh, I do not use data streams and didn't catch that in the documentation.

If you are not allowed to write on backing indices sending requests directly to it I don't think there is any workaround, you will need to stop using data streams and use normal indices with an alias.

I don't think there will be any difference at all, at least from the documentation, a request to a data stream will query all backing indices, the same thing will happen when you send a request to an alias, it will query all indices with that alias.

The limitation in data streams seems to be only where you need to write.

Welcome to our community! :smiley:

Is that really an issue though? You should talk to the datastream alias to query, and Elasticsearch will handle the query efficiently.

Yeah, that's what we feared. However, if you don't think that there is any performance drawbacks from using an index alias instead, then that would be the solution to our problem.

Anyway, thanks for your help @leandrojmp :slight_smile:

@warkolm

Thanks! :smile:

Say that we would want to query telemetry for a specific month, wouldn't it be much more performant to narrow down which backing indeces we hit by using a query like

GET /.ds-telemetries-2022.08.*/_search
{
...
}

instead of querying all the backing indeces by querying the data stream directly? And we cannot do that unless the telemetry is indexed in the "correct" backing index.

Or do you think that's unnecessary? :thinking:
We haven't used data streams before, so we might be unaware of how data streams queries work behind the hood.

Using time-based indices covering soecific periods through the name and querying through an alias means that data will be colocated by time. Indexing old data may be more expensive as you target more shards but you pay this price only once. When you query using a time period you know a lot of shards will not hold any relevant data which will generally speed up queries.

In old versions of Elasticsearch querying all indices used to be expensive, so Kibana used ways to determine the correct indices to query before running the query against just these indices. Initially this was done by calculating index names based on timestamp range but later this was replaced with a call to an API which provided the timestamp range for each index (removed issues with data being in the wrong index). In more recent versions Elasticsearch has improved a lot and this is no longer necessary. Querying an index that does not have any data within the timestamp range is now quite quick and efficient. I therefore do not think you need to worry about querying all backing indices through an alias.

Alright, thanks for your input, that's really helpful.

Then it sound like we shouldn't worry about any of this, just query the data stream directly, and rely on Elasticsearch to optimize our queries.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.