We have a system that collects, processes and stores customer generated data. The data comes from multiple streams with varying cadences. Some are near real time, and others 24 to 48 hours delayed. We are currently storing this in one large index, but it is becoming unwieldy. I am looking into breaking this into time-based indexes based on month. I am familiar with teh documentation related to rolling over indexes and using aliases. This is very straight forward. That scenario seems very geared to logging, however, and our situation is a little different.
First, what we care about is when originating document was created by the customer (i.e. not when the processed document was indexed into ES). Our system is based on create_date, and that is what we know deterministically. If the created_date and index date fall on opposite sides of a index rollover, it is not as straight forward to grab that item. Instead of a direct link to the doc, we would have manage aliases for it (i.e. would have to look at the dates, determine if it was close to a roll over date (possibly) and include multiple indexes.
Also, there are times where we bulk update and have only a list of ids. In this scenario (i.e. the documents corresponding to those ids may be spread out over monthly indexes over up to 36 months), how can we bulk edit? Can we just use the _all?
Is there a butter way I should be approaching this?
Thanks!
~john