Historical (Past) Data Ingestion in Time Series Data Stream

Is it possible to insert historical (past) data into a Time Series Data Stream (TSDS) of Elasticsearch? To be more clear, suppose, I want to ingest NYC Taxi Trip Data from Year 2009 to 2024 in a time-series data stream. Is that possible? I tried but couldn't do it.

If it's possible at all, how to do it?
If it's not possible, what can be the best possible workaround with Elasticsearch?
Someone please guide ...

It would help if you provided information about exactly what you did try, what errors you received and why you say it did not work.

I faced the following Error :
the document timestamp is outside of ranges of currently writable indices
and couldn't solve it because, index.start_time and index.end_time can't be set dynamically.

I do not use TSDS, but looking at the documentation it seems that this is not possible.

It says that data must be in the accepted time range and also this:

If no backing index can accept a document’s @timestamp value, Elasticsearch rejects the document.
Elasticsearch automatically configures index.time_series.start_time and index.time_series.end_time settings as part of the index creation and rollover process.

You have a setting named indx.look_back_time to allow things in the past, but the maximum supported interval is 7 days. [documentation]

2 Likes

@leandrojmp ,
Thanks for your reply. So, my best option is to use standard indices with some optimization techniques? Or is there a better way? What's your suggestion on handling large scale (high frequency) historical time-series data using Elasticsearch?

Heya @rubayetahmed314

Can you point me to the NYC Taxi dataset...

Interesting question....

Pretty sure you can use TSDS .... Not sure it is really needed...

I played with this data set in the past and never used tsds....

So my question is, are you just looking to play with the New York City taxi data? Are you trying to learn about tsds

It may not make sense to combine those two things.... Not saying you can't But not sure it's best use of time or learning.

But you did get my interest... :slight_smile:
I might be able to look at this a little later

Pretty sure we can set the starting in time because you have to do it if you ever re-index a tsds
Reindex a TSDS | Elastic Docs.

1 Like

Here is the dataset

I actually need to handle historical (past) and ongoing (present) time-series data efficiently using Elasticsearch, but performance should be on par with say TimescaleDB or time-series specialized databases.

Then why not just use TimescaleDB directly? Because, my company's current stack is heavily aligned with (dependent on) Elasticsearch.

Using TSDS is not mandatory in any way, but I found it's architecture really close to other time-series specialized databases. That's why I got interested in it, but then could not ingest historical data (starting from 2009) and opened this topic.

I just need to handle time-series data (both past and present) using Elasticsearch in the best possible way. I mentioned NYC Taxi Trip Dataset as a quick reference. If I can handle that Terabyte scale dataset seamlessly, it should be good for my use case as well.

tl;dr; This ^^^^ depends entirely on the details of your use case.

2nd: Elasticsearch’s (TSDS) can work well for metrics-type data, but there are trade-offs. If your main concerns are storage efficiency or high-throughput metric aggregation at large scale, purpose-built time-series databases may perform better.

That said, Elasticsearch does offer flexibility — for example, it allows combining time-series data with full-text search or geo-spatial queries. And if you're already using Elasticsearch for other types of data, using it for time series can simplify your stack.... Seems like this is you ...

Technically, yes (under some specific constraints) — but it’s probably not a great fit

According to Elastic's documentation:

Only use a TSDS if you typically add metrics data to Elasticsearch in near real-time and in @timestamp order.

So I will back track a bit.....

If it’s just a one-time historical load, TSDS might work with some care. But if you plan to regularly index non-real-time or out-of-order data, TSDS is likely the wrong choice.

A regular data stream or manually managed time-based indices will probably be more flexible in that case. You might need to do some routing etc...

Soooo if we are going to help steer..... we would probably need a little more detail about the Actual use case, like....

  • How much data (document count, size)?
  • What’s the ingest rate?
  • Are you doing an initial load plus ongoing ingestion, or just starting from scratch?
  • Is the data mostly current, or is it often delayed or historical? How far out of order?
  • Over what time range? Days, months, years? What distribution?
  • Do you know time ranges ahead of time?
  • Are you planning to delete or archive old data?
  • Will documents be updated after indexing?
  • What kind of queries or aggregations will you run?

My thoughts... others may have theirs....

All this said we probably won't be able to fully give you an answer (I doubt you expect one) but perhaps we might be able to keep you out of the trench :slight_smile:

1 Like

How much data (document count, size)?

Initial (historical) load is last 3 years data, every minute, so around 15.7 Million docs.

What’s the ingest rate?

Current ingestion rate is, every minute. But client wants flexibility to ingest every second or change the ingestion rate if they need in future.

Are you doing an initial load plus ongoing ingestion, or just starting from scratch?

Initial load plus Ongoing ingestion.

Is the data mostly current, or is it often delayed or historical? How far out of order?

Ongoing data is mostly current, but that initial load is historical (last 3 years).

Do you know time ranges ahead of time?

Not always. Because client will upload those past data files in small parts and they want flexibility to upload any part (non-sequential) they like.

Are you planning to delete or archive old data?

Yes, specially downsampling is in the plan, but we'll need to delete after a certain period.

Will documents be updated after indexing?

Client want that flexibility.

What kind of queries or aggregations will you run?

Mostly to get insights on few particular metrics over a defined time range, grouped by dimensions.

All this said we probably won't be able to fully give you an answer (I doubt you expect one) but perhaps we might be able to keep you out of the trench :slight_smile:

That's Ok. I just need a nudge to head in the right direction. That would be helpful.

Heya @rubayetahmed314

Just to double check ...

60 mins * 24 hours * 365 days * 3 Years = 1.57M (not 15.7M) records.... so just checking... assuming these are not Gigantic docs that is pretty small.... Now is that just 1 series of data and there are 1000s of these? am I missing something?

Also quick check ... If you ingest 1 document / sec for a year that is about 31.5M records / year... not so small ... but not huge either...

So I think this all comes down to a simple tradeoff....

Single Index: 1 Index for ease management on ingest / search / updates. You could break it up into multiple shards and even get fancy with routing etc..etc.. (I might be a little concerned on this if you plan to go 1 sec ingest)

Multiple Indices: More than 1 index, Little more complex on ingest / updates not much more complex on search aggregations... Little more maintenance / Little more flexibility. The magic question is How many and what time frames. A simple way would be by Years but you could do all sorts of things...

BOTH: A good, clean thoughtful mapping (schema) is a key to all of this.... which is related to How big in bytes / how complex are the docs?

So what to do....

IF it is really 1.57M records and there is no immediate need to go to 1 sec and your documents are not something astronomical I think I would go with 1 Index .... KISS principle.

You can always reindex data later into separate indices... but breaking up by time may not be trivial, but can be done.

IF you want to spend time up front and build some routing logic into the ingestion logic (or even in an ingest pipeline) you can... but it may not be needed... and you could probably add it later if the go to 1 second..... or something changes ... dramatically.

There are other things in your requirements that could lead to breaking it up. You mention downsampling... with data at 1 min granularity and this small sizes hmmm... but also downsampling and inserting data not in order ... are not really compatible concepts so I am not clear on that. You can build transforms (which is what downsampling is built to pre-aggregate data you might need to re-run some if old historical data is added / updated etc. )

Me... I would load your 1.57M records and see what it looks like.... for search speed you can add replicas if needed...

Thoughts?

Thanks for the guideline. Sorry for that calculation mistake earlier.

I was thinking about Yearly index breakdown (for per minute data) and using alias for easier management. Is this a good idea?

And, I have one confusion. In TSDS, it somehow manages to skip shards based on the time range being queried. How can I achieve this same advantage with standard index when I use UUID for each document?

That was my first inclination.

Conflating a couple concepts here ... so lets tackle them separately.

TSDS Skipping shards. Yes it can do that because it know the time range of the search and the min/max time for each backing index which means shards. This is done because in systems at scale there may be 100s or 1000s of shards and the optimization can make a significant positive performance impact.

For what you are talking about a few shards maybe in the 10s I suspect trying to do that is perhaps an unneeded / over optimization unless you are expecting extreme query volumes and complexity.

Yes can something be done... custom document routing on ingest and search etc...
Building some intelligence in your query layer to sub select the etc... but it is unclear whether you even need that, and that could cause other complexity ... what if you are searching across years etc...

I would do some testing first before worrying about this... heck if it is really 1.5M - 3.0M avg size docs... the whole data set may end up in memory ... or if you need more throughput replicas...

I think good mappings, good / properly sized HW will have a bigger impact then trying low level optimizations to start

Pretty much unrelated.... UUID and Skipping Shards based on Time Range... they are not really related.

Providing your own documents id.... _id there is often a preconceived idea that this is good practice or needed.... most the time it is not.

Unless you plan to access / or update using the PUT / UPDATE document APIs it is probably useless... you can still store your UUID as a keyword field and the be able to access, search / update documents by it.

IF you are going to update documents directly by _id perhaps there is a use for it.

To be clear using your UUID as the document _id is OK, but there can be negative ingest performance on very Large Datasets (which at this time your use case does not qualify as)

I guess I am coming back to... Nothing I have seen yet indicates spending a lot of time up front doing low level optimizations... We can chat a little about routing document on ingest and search.... that will / can add complexity... very much unclear if that is needed.

This will really come down to your search and aggregation use cases... that will drive everything else.

1 Like

Ok. I’m starting with a simple setup for now.
@stephenb, thanks a lot for giving your time.

1 Like

Your welcome!... come back with specifics... as they evolve.

1 Like