I need to bulk index a large number of documents starting with empty index. I know I can do a (long-running) _forcemerge later to reduce number of segments to 1, but I wonder if there is a settings so I can get an index with a single segment without this step?
I am indexing on a separate server and I want to get index fully optimized for fast searches.
Elasticsearch assumes that you're going to continue to index documents and therefore does not merge segments this enthusiastically. Over-merging (e.g. merging to a single segment on a shard that is still indexing documents) can cause performance issues, so Elasticsearch won't do it unless you specifically ask it to.
Why do you want to avoid a final _forcemerge after your indexing has finished?
Because it takes very long and I know from the very beginning that I am starting with empty index, add ~100 millions of docs and then use index in read-only manner.
I want to get single-segment index as fast as possible.
I know of no easy way around having Elasticsearch write its data in multiple segments and then merge them together later.
If your documents do not fit into memory (specifically the indexing buffer) then Elasticsearch needs to write a segment each time this buffer fills up.
If your indexing generates a translog larger than index.translog.flush_threshold_size then Elasticsearch will perform a flush each time this threshold is reached.
You should be sure to follow the instructions on tuning for indexing speed since this will help to generate fewer segments. Particularly, ensure that you are not refreshing too frequently.
Are you sure that the final force merge is actually worth it? You seem to be trying to optimise the process of building a shard from scratch, which implies that you will be doing it quite often. How much extra search performance does it buy you?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.