I have created an Index of 200M docs and will be updated frequently. 50M docs will be updated in a month. In short, the Index is read/write heavy in nature.
As Elastic/Lucene says it will not do an actual update but it will delete/add, that means the deleted docs will resides in Index but will not be searchable.
Lucene occasionally merges segments according to merge policy, which is costlier.
So my question is
How would be my Index read/write performant in such scenario?
Is there any alternative to deal with such scenario?
Can we disable soft delete in Elastic/Lucene and allow only hard delete?
The only way to not use the delete-is-mark-then-merge-away behavior is to create an entirely new index and then drop the old one. Some people do this, but mostly because they don't have triggers to sync changes so they must so it periodically.
I ran a system that had many updates. Performance was fine. The standard advice of setting the refresh time to 30 seconds if you can tolerate it is good here. As is watching updates and dedupicating them.
We are building a system that takes(update) data from multiple sources and update the master data. At the same time, consumer of the master should be able to get the updated data from master real time.
Did POC using following configuration
3 node cluster
RAM : 60gb
HD : 1.5 TB SSD
CPU : 8 core
POC :
Indexing and searching(fuzzy query) simultaneously in multi-threaded environment.
What I observed is, Indexing slows down drastically. If I do only indexing then the performance is very good.
What was limiting performance when you indexed and searched at the same time? CPU? Disk I/O? Did you see any reports in the logs about long or slow GC?
Indexing, merging and querying use the same system resources, so off course they can affect performance and each other. You need to test to see that you have enough system resources available for the combined load and that you at that point are able to index and query with acceptable performance.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.