Indexing performance over index size

After performing quite a lot of experiments on a 5 node ES cluster, I
noticed that performance is heavily impacted by the index size.

https://lh6.googleusercontent.com/-G24uCWlCxPc/UfeCFJCmm5I/AAAAAAAAABM/r5Bxda-XkXo/s1600/ES_slope.png

So when I performed bulk indexing into some empty indexes, this was the
performance I go (check img above).

If I understand well, performance is not affected by other big indexes that
exist on the cluster but are not getting indexed at the time.

Thus, is it best to perform indexing into a new index every day or week
using aliases in order to manage them?

Given the fact that at the same time, I want to keep performing search
queries on all the indexes (current and past ones) does the above make any
sense to do?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

How long is the duration you performed the indexing? What VM, what settings
of ES?

In my experiments, I can index for hours, and the index throughput scales
well with the number of nodes (where the remote feed client is mostly
network and CPU bound).

Maybe you measured Lucene segment merging activity? Because this is what
takes additional time when index grows, compared to an empty index, not the
index size.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On a 5 node cluster (4 cores, 4G Ram per node), we had 4 node-clients
continuously bulk indexing data for 2-3 days. The biggest change is on the
first 10 hours or so where performance declines rapidly
2 of the appls did bulk indexing on separate indexes while the 2 other did
it on the same index. On the first 10 hours we had indexed 25G and totally
about 45 index at the end of our experiment.

On Tue, Jul 30, 2013 at 4:17 PM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

How long is the duration you performed the indexing? What VM, what
settings of ES?

In my experiments, I can index for hours, and the index throughput scales
well with the number of nodes (where the remote feed client is mostly
network and CPU bound).

Maybe you measured Lucene segment merging activity? Because this is what
takes additional time when index grows, compared to an empty index, not the
index size.

Jörg

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/TVjxjMCJSp4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Did you use custom settings for index refresh, segment merging, JVM GC?

The default values
http://www.elasticsearch.org/guide/reference/index-modules/merge/ are a bit
straight for a 2G heap and long-time bulk indexing. I would suggest 1G for
max_merged_segment and 20 segments_per_tier for more consistent performance
on your configuration, together with disabled index refreshing and GC
tuning.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.