Would there be an impact / difference of Big and Small Indices?

Hello and good day!

Just like what the title says, I humbly request for you guys' expertise and a healthy, bountiful discussion regarding this....

My team has 2 indices called "web" and "socialmedia", both of them got their data / documents from a MySQL Databases that has collected millions of data since year 2015.

Now our IT Head is planning up to create micro indices that would chop or chunk these 2 BIG indices into smaller ones -- whereas we'll be dissecting each of them per months.

Here's the idea, for example, the "web" index has a whole set of year 2019 data / documents, and we're planning to chunk these data into chunks each month, from "web" index to -> "webjan2019", "webfeb2019", "webmar2019" (meaning, a lot of data with a "time" falling for a specific month would belong to a specific micro index) and so on...

Now back to the main questions:

  1. Would there be a big impact or big improvement if we'll be performing the aforementioned idea?
  2. Would there be heavy drawbacks by doing so?

Thank you very much in advance, I greatly appreciate your invaluable expertise!

We aren't all guys :slight_smile:

Time based indices 100% make sense for time based data, rather than a single monolithic index. You should look at using ILM to mange them as well.

There is a slight overhead with managing more shards, but that should be mitigated by more efficient searches.

The best advice is to keep shard size <50GB. If that means weekly, monthly, or even yearly, then that's fine.

+1 to all points Mark mentioned.

Aligning indices to your data expiration period can simplify clean up. Instead of deleting documents from an index, you can drop indices.

But overall what is the point, i.e. is there a problem to be solved? Millions of docs is not very large and easily purged, etc. as needed. ES has no problem also 'searching' for all the docs from 2019 so unless you have performance or other issues, why break something that's working? Worst case, break into years if you really want to purge, etc.

hello sir, it's troublesome especially if you need to reindex a lot of documents.

I mean, for example you wanted to change a mapping of a certain field, or add a new column with analyzers, you have to reindex roughly 300 million + documents just so you can use the index again

I mean, please take note that the 2019 documents are only my example, we actually have around 300m+ docs ever since year 2015

Sure but wouldn't you do that for all docs, or only the latest ones? Reindexing is not very common for most people. Though if you do it for only some docs and fairly often, then sure, split them up and alias the whole set so you don't have to change your code, etc.

our indices are always changing depending on what the client or the boss needs, so we're always reindexing them whenever there were changes... so yeah, it's kinda hard to reindex a one BIG index that has a lot of data since 2015...

so with your great expertise sir, is our concept would be great as an optimization or improvement of our infrastructure?

Well I guess if you are always changing them, you’ll need smaller indexes and thus can break them up. Though instead of re-indexing I wonder if you can just index into other smaller temp indexes with more fields or whatever analyses you need; guess it depends a lot on how you use the data - constantly changing indexing is not very common.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.