Elastic Data Stream Creation And Search

I want to insert about 3 billion record data in elastic.
In this migration, data in each backing Indice is not sorted according to time.
But when I want to search for data, my search is related to time.
According to things that I understand. When I search data, elastic will be searched in all backing indices and compare all results in all backing indices and finally return the result.
My first question is am I right about this concept(search in all backing indices)?
Or is exist any way to limit the search of all backing indices in search of the data stream?

I have an idea.
I want to store all data according to the times in all backing indices. Data started from 7 years ago. I want to create a backing index for each month and save all data just to related month time.
Then in search time detect request time month and directly search on the corresponding backing indice. Not search in the data stream and search in all backing Indice.

To achieve this, I need to manually create all backing indices and put the desired name on them.
For example :
my-data-stream-2015-01-01
my-data-stream-2015-01-02
my-data-stream-2015-01-03
.
.
.
my-data-stream-2022-01-09

Is there any way for creating a backing indice manually?
OR
is good my idea for the detect corresponding backing indices and directly search on them?

If I have misunderstanding about data stream storing data or search concept, please say to me.
Thanks

Welcome to our community! :smiley:

It searches all indices that hold data that matches any filters. So it won't search everything unless you are not filtering.

So you shouldn't need to worry about the naming structure of the indices behind a datastream.

Data streams rely on rollover and assume data arrives in order, often in near real time, so that data for a specific time periods ends up colocated in indices. This allows Elasticsearch to efficiently exclude data at query time if you use a time filter as some indices quickly and efficiently can be excluded as they do not hold any data that will match the query. This approach offers very efficient indexing as well as bulk indexing requests are sent to a single index at a time.

If you with this approach send in data covering a long time period that is not sorted you benefit from the indexing speed but as data will not be colocated based on timestamp you will at query time need to query most indices, which will add load.

The other approach that is available to you is to index into time-based indices where the time-period covered by each index is included in the name, e.g. monthly or daily indices. This is what was used before rollover was introduced and works well. Here you determine which index a document belongs to based on the timestamp and data will end up colocated based on timestamp, which helps at query time.

If your data is not sorted you will however end up with bulk indexing requests potentially targeting a large number of different indices. This means that for each bulk request only one or a handful documents will go to the same index, which adds slows down indexing significantly. as it is a lot less. efficient.

As you can see you need to trade efficiency indexing against efficiency querying. As you only index once but query frequently I would probably recommend the second options and avoid using data streams. Even if you can not sort the full data set you might be able to improve indexing efficiency by sorting subsets (reasonably large - corresponding o many bulk requests) of data before indexing it into Elasticsearch.

1 Like

Thank you very much for your reply

According to your reply and @warkolm reply i decide to store all data into corresponding data time relative to each month. it is possible in migration process even it can get more performance but it is just one time.
After create all indexes and when migration finished, i want to set all indexes to an alias and finally convert alias to data stream.
With above approach i can implement both plans at search time.
As first plan, i just search in corresponding indice based on indice time.
As second plan, i search in data stream directly.
Also i decide to store all data with sort time in each segment.
Finally i want to compare both of plans and choose best of them.

Another important point is that after the migration is complete, the new data will be inserted according to the correct time and order.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.