In our current setup, we're scraping a site every day and indexing the results, basically overwriting the same data over and over again, every 24 hours.
Since I know the index is going to be read-only for 24 hours and I can see a large number (100+) segments with a lot of deleted docs, I thought forcemerge might benefit us after every scrape is done.
But ... the documentation states: Running force merge against a read-write index can cause very large segments to be produced (>5Gb per segment), and the merge policy will never consider it for merging again until it mostly consists of deleted docs. This can cause very large segments to remain in the shards.
Does this mean that my indices are now excluded from automatic merging since I called forcemerge myself? Or was it only if it was accidentally written to while merging? Is forcemerge setting a flag that basically says don't touch for automatic merging? If so, what would I need to do to get it back to it's original state? I can't seem to be figuring out if calling forcemerge once is causing a lasting change to the index that means i'll have to manage it myself forever.
Lot's of questions here, sorry, but I can't seem to find the answers anywhere and the documentation seems to be missing some explanations.
No, Elasticsearch will still check the index to see if it needs merging, it does so all the time. But if you run a force merge and it results in a very large Lucene segment (I believe this description is still good), that segment will normally not be eligible for a future merge since auto merging is done by joining several smaller segments into one large, leaving out the deleted documents from the merged segments in the process.
Thus, if your auto merge results in a 5G or larger segment, that segment may live forever which means if you overwrite documents residing in that segment those deleted documents will never be erased from the index because the large segment doesn't get merged.
In your use case, of overwriting documents every 24 hours, I don't think force merge is a good solution.
Instead I would suggest two alternatives:
Reindexing to a fresh index and replacing the old index once done.
Creating new indices every day or week.
The first alternative relies on the Reindex API and allows you to decide when and where to remove the deleted documents by running a reindex job to a fresh index. By using an Index alias you can hide the specific index name from the clients so that nobody will notice when, after the end of the reindexing, you switch the alias to point a the new index and then deletes the old.
The second alternative may be a bit more cumbersome but it depends on how you do the web scraping. If it's done in a short burst, once a day, I would simply point it to a new index every time and just drop the old one when the new is complete (here an Index alias would also be useful). However, if you continuously run web scraping and index the documents, you may have to update two indices for the time it gets to complete the new one before deleting the old. But in both cases you don't have to worry about the deleted documents, in the first case because there aren't any and in the latter because the old index, with the overwritten documents, will be deleted anyway.
Wow, that is the best explanation. It now makes complete sense - thank you!
Currently, storage is not that big of an issue, but performance-gains are always welcome.
With option two, after scraping is done, and before switching the alias, would I gain any noticeable query performance by merging 150 segments down to 1 on a 5GB index?
Yes, I have experienced an improved performance after a force merge.
I haven't tried to do a force merge down to just 1 segment though, just down to 5; I had an old index with several million deleted documents and noticed that each shard had many hundred segments, most of them fairly small. So I did a force merge, setting max_num_segments=5, and from a short test I ran afterwards the search times were reduced by about 30%.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.