So when I performed bulk indexing into some empty indexes, this was the
performance I go (check img above).
If I understand well, performance is not affected by other big indexes that
exist on the cluster but are not getting indexed at the time.
Thus, is it best to perform indexing into a new index every day or week
using aliases in order to manage them?
Given the fact that at the same time, I want to keep performing search
queries on all the indexes (current and past ones) does the above make any
sense to do?
How long is the duration you performed the indexing? What VM, what settings
of ES?
In my experiments, I can index for hours, and the index throughput scales
well with the number of nodes (where the remote feed client is mostly
network and CPU bound).
Maybe you measured Lucene segment merging activity? Because this is what
takes additional time when index grows, compared to an empty index, not the
index size.
On a 5 node cluster (4 cores, 4G Ram per node), we had 4 node-clients
continuously bulk indexing data for 2-3 days. The biggest change is on the
first 10 hours or so where performance declines rapidly
2 of the appls did bulk indexing on separate indexes while the 2 other did
it on the same index. On the first 10 hours we had indexed 25G and totally
about 45 index at the end of our experiment.
How long is the duration you performed the indexing? What VM, what
settings of ES?
In my experiments, I can index for hours, and the index throughput scales
well with the number of nodes (where the remote feed client is mostly
network and CPU bound).
Maybe you measured Lucene segment merging activity? Because this is what
takes additional time when index grows, compared to an empty index, not the
index size.
Did you use custom settings for index refresh, segment merging, JVM GC?
The default values http://www.elasticsearch.org/guide/reference/index-modules/merge/ are a bit
straight for a 2G heap and long-time bulk indexing. I would suggest 1G for
max_merged_segment and 20 segments_per_tier for more consistent performance
on your configuration, together with disabled index refreshing and GC
tuning.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.