Just like what the title says, I humbly request for you guys' expertise and a healthy, bountiful discussion regarding this....
My team has 2 indices called "web" and "socialmedia", both of them got their data / documents from a MySQL Databases that has collected millions of data since year 2015.
Now our IT Head is planning up to create micro indices that would chop or chunk these 2 BIG indices into smaller ones -- whereas we'll be dissecting each of them per months.
Here's the idea, for example, the "web" index has a whole set of year 2019 data / documents, and we're planning to chunk these data into chunks each month, from "web" index to -> "webjan2019", "webfeb2019", "webmar2019" (meaning, a lot of data with a "time" falling for a specific month would belong to a specific micro index) and so on...
Now back to the main questions:
Would there be a big impact or big improvement if we'll be performing the aforementioned idea?
Would there be heavy drawbacks by doing so?
Thank you very much in advance, I greatly appreciate your invaluable expertise!
But overall what is the point, i.e. is there a problem to be solved? Millions of docs is not very large and easily purged, etc. as needed. ES has no problem also 'searching' for all the docs from 2019 so unless you have performance or other issues, why break something that's working? Worst case, break into years if you really want to purge, etc.
hello sir, it's troublesome especially if you need to reindex a lot of documents.
I mean, for example you wanted to change a mapping of a certain field, or add a new column with analyzers, you have to reindex roughly 300 million + documents just so you can use the index again
Sure but wouldn't you do that for all docs, or only the latest ones? Reindexing is not very common for most people. Though if you do it for only some docs and fairly often, then sure, split them up and alias the whole set so you don't have to change your code, etc.
our indices are always changing depending on what the client or the boss needs, so we're always reindexing them whenever there were changes... so yeah, it's kinda hard to reindex a one BIG index that has a lot of data since 2015...
so with your great expertise sir, is our concept would be great as an optimization or improvement of our infrastructure?
Well I guess if you are always changing them, you’ll need smaller indexes and thus can break them up. Though instead of re-indexing I wonder if you can just index into other smaller temp indexes with more fields or whatever analyses you need; guess it depends a lot on how you use the data - constantly changing indexing is not very common.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.