I have some old indexes that are not being updated anymore. They are read-only, and I'm looking to optimize the segments for the best performance.
I have heard conflicting arguments. Some argue that you should have as many Segments as CPU's, as searching segments is a multi-threaded operation. It will be more expensive in terms of CPU / Memory cost, but it will be faster.
Others argue that if it's read-only, it should only have one segment. As it reduces unique terms / memory cost, and it only takes one search instead of say 8 on an 8-CPU system to get a result back.
For Lucene 3.5.0 (Which ES is on), and considering that ES is already sharded (EX: 16 shards) does it make sense to just have 1 segment?
For example. If my index has 16 shards. And in each shard there are 8 segments. To search that one index would be 128 search operations. If each shard had one segment, it would be 16 search operations. Which one provides the best performance?
Search on a single Lucene index across segments is not parallelized (you
can do it in Lucene, I find little use for it in a system like
elasticsearch). So, yes, going down to less segments will reduce the index
size, memory size, and improve performance. Note, going down to a single
segment means rewriting the whole shard (and all shards in that index)
which can be expensive.
I have some old indexes that are not being updated anymore. They are
read-only, and I'm looking to optimize the segments for the best
performance.
I have heard conflicting arguments. Some argue that you should have as many
Segments as CPU's, as searching segments is a multi-threaded operation. It
will be more expensive in terms of CPU / Memory cost, but it will be
faster.
Others argue that if it's read-only, it should only have one segment. As it
reduces unique terms / memory cost, and it only takes one search instead of
say 8 on an 8-CPU system to get a result back.
For Lucene 3.5.0 (Which ES is on), and considering that ES is already
sharded (EX: 16 shards) does it make sense to just have 1 segment?
For example. If my index has 16 shards. And in each shard there are 8
segments. To search that one index would be 128 search operations. If each
shard had one segment, it would be 16 search operations. Which one provides
the best performance?
Is the answer in this old thread still true? Specifically, is it true that "a single Lucene index across segments is not parallelized"?
Also, related question: Is it always better to merge all segments down to a single segment if you are willing to wait for the merge? Are there some cases where multiple segments are desirable? The documentation suggests using five segments, but does not say why.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.