Avoid 'indices hell' by copying and deleting data

Hi guys,

I'm working with Elasticsearch for a while and I'm dealing with the following:

  • I'm creating indices by three dimensions: language, document type and time (new index per day)
  • I use aliases for grouping indices based on data 'freshment'

For each day, incoming data is around 10.000 documents, but it will grow and we have to be prepared for hundred-thousand documents per a day.

Data is full-text searchable and 90% of searches are querying data from the last 48 hours.

Now consider the following:

  • my last 48 hour data has an alias (let's call it F - fresh)
  • my last 30 days data has another alias (let's call it M - middle)
  • my older data has another alias (let's call it O - old)

If I just create indices day by day without managing them (except aliases), after a year I would have 365 * number_of_lang *number_of_type index which is too much I guess, especially if you see it from the shards.

So avoid this I'd create the following structure, for one language and one type:

  • fresh data is in as much physical index as necessary under alias F
  • middle data is in one physical index under alias M
  • old data is grouped by yearly and yearly indices are under alias O

Moving data would be a scheduled job which does the following:

  • F-->M : realias proper indices from F to M
    --- move data from daily indices in M to the one physical index (by copy and delete documents)
    --- delete empty indices from M
  • M-->O: the same logic, except that I move data from the M's physical index to the proper yearly index

It should be a daily scheduled job, but I'm not sure that it would be work (in terms of how it would perform, is it worth, etc.)
What do you think about this?

It seems like you'll have visibility issues with the migrated data flashing in and out of visibility. You are right to be concerned about 365 * lang * type indexes. A couple of things:

  • Maybe combine the language indices. You could maybe do something like this.
  • Maybe use weekly indices or 10 days per index or something. The length of time in a single index is the "extra" that you store past the end of your year. You can enforce visibility on F and M using range queries. That should be reasonably quick.
  • Combine your types into one if possible. You have to be careful because combining different types of documents makes the index more sparse which isn't good for some of the data structures like doc values but it might be a good idea. Because of sparsity issues this might be a bad idea if the document types are different.

Thanks for your reply!

  • Combining the language indices may be a good approach since we have to support 10 different languages (which is a limited number) and we can legitimately expect that 90% of the documents are in one language).
  • I wouldn't use weekly indices since 90 percent of queries are on the data from the last two days. Why should I keep 5 days data unnecesseraly? Wouldn't slow down the system, would it?
  • Combining types seems to be also not a good approach since we'll have at least 4 different types, and we know that dispersion between types are 2/3 1/6 1/6 and for the fourth type is unknown.

Some, yeah. Having a zillion indexes isn't good for the system either and I'm trying to enumerate tradeoffs that might be useful to get you fewer of them.

It wouldn't slow it down that much either - adding the range query slows down the searches a bit, but because you are adding data in order locality stuff will continue work well.

About combining types: the count of the documents doesn't make a difference if they are similar or if you can make them similar by renaming fields. If you can't then having each one separate makes sense.

If I think forward your reply it turnes out that there are no performance difference between having separated indices by type and language or putting all types and languages into the same index (but this index would have more primary shards). Am I right?

The whole game is about increasing search performance, that's why I'm asking.

Not no performance difference, just not enough to have so many indexes. Having that many counts as "a lot" and, while it works for the most part, things generally work better when you don't have that many indexes.

Regarding separating by language - it depends. I can imagine it being faster not to separate if, for example, there are lots of common fields and the language dependent fields aren't that large.