Indexing, querying and bulk updating against time-based indexes

We have a system that collects, processes and stores customer generated data. The data comes from multiple streams with varying cadences. Some are near real time, and others 24 to 48 hours delayed. We are currently storing this in one large index, but it is becoming unwieldy. I am looking into breaking this into time-based indexes based on month. I am familiar with teh documentation related to rolling over indexes and using aliases. This is very straight forward. That scenario seems very geared to logging, however, and our situation is a little different.

First, what we care about is when originating document was created by the customer (i.e. not when the processed document was indexed into ES). Our system is based on create_date, and that is what we know deterministically. If the created_date and index date fall on opposite sides of a index rollover, it is not as straight forward to grab that item. Instead of a direct link to the doc, we would have manage aliases for it (i.e. would have to look at the dates, determine if it was close to a roll over date (possibly) and include multiple indexes.

Also, there are times where we bulk update and have only a list of ids. In this scenario (i.e. the documents corresponding to those ids may be spread out over monthly indexes over up to 36 months), how can we bulk edit? Can we just use the _all?

Is there a butter way I should be approaching this?


The rollover API is great for immutable data, but will make it hard to efficiently determine the index for update operations. If you have the created_date as part of the document_id, you are probably better off creating time-based indices based with the year and month in the index name. This will allow you to determine the index name based on the document_id, which will simplify bulk updates.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.