When we index our data from our database (using the HibernateSearch-ElasticSearch implementation) we found we were getting a large amount of segments (20-30). On our current data we found that force-merging the data (2 indexes, text records about 10kb in size, 1m on one index, 100k on the other, both with 1 shard and 2 replicas), could increase search performance. This isn't as necessary due the smaller size of the data.
We are currently in the process of scaling our data up towards 25m in the first index, and 2m in the second. We have split the first index into 2 shards. Search performance for even a basic match all has increased almost linearly, from 100ms to 2000 ms on the first index, and 20ms to 200ms on the second. More complex queries take far longer. There is also a higher segment count after mass indexing the data, with around 40-50 segments after massindexing (and leaving for a few days to automatically merge).
Can we tune the Merge Policy to be more aggressive with its merges? If we can will this lead to any kind of increase in performance? I am currently running a force merge to see if manually reducing it will help, but it is taking a long time to complete. Once it is done I can report on the time taken for queries with the reduced segment count.
Some specifics about our scaled data:
Mass Indexed from DB once a week, has to be done weekly to pick up changes in database not mapped to index. This means the whole index is recreated, taking us back to the increased segment count.
Updated hourly, with about 2-5k records updated in each index.
Most records are updated several times in their life time, with more recent records having a higher chance of being updated again.
Any help with tuning this to automatically merge, or other ways to increase search performance would be greatly appreciated!
EDIT: Forgot to mention concern over force merging with data this large is that the segments will be over 5GB in size, and so won't be merged again. However I am unsure if this will be an issue if we rebuild every weekend anyway? Also anyway we can make this automatic, instead of having someone run the query is preferred, as when this moves into a production environment having someone press the button once a week isn't ideal.