Merge/segment understanding

Han_JU · March 28, 2014, 2:07pm

Hi,

We want to understand how segments are created during bulk indexing.
Say we've set the following param:

"index.translog.flush_threshold_ops" : "50000",
"index.translog.flush_threshold_size": "300mb",

So it means that ES will not flush until it gets 50000 operations (index in
this case). As a result, there's always 50000 documents get
flushed/committed to Lucene
at a single time. So it's intuitive for us that Lucene will not create
segments that has under 50000 documents.
But in our benchmark with this settings, we found out that there's lots of
segments with, say, ~3000 documents, and the segment's size is far less
than 300mb (the flush threshold).

My questions are:

How do these small segments get generated given that we flush 50000
documents at a time?
Does avoid generating small segments helps indexing speed and merge speed?

We are using ElasticSearch v1.0.1 and we also set these when benchmarking:

{
"index":{
"merge.policy.max_merge_at_once":"999",
"merge.policy.segments_per_tier":"999",
"refresh_interval":"-1",
}
}

Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/21ec20de-0a6d-4196-abf3-d12287544b7f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Binh_Ly_2 · March 28, 2014, 7:32pm

The indexing buffer could also fill up which will flush to a segment. Also
the translog flush is not "exactly" deterministic, for example
"index.translog.interval" determines how often to check if the translog
needs to be flushed or not. Anyway, I wouldn't worry about it if I were
you. About the merge, I'd probably leave the defaults alone unless you are
absolutely sure changing them helps you. The more segments there are, the
more time it could take to do a merge.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ead045d0-714a-4906-9d1a-c8f1bca59512%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Han_JU · March 31, 2014, 10:03am

Thanks Binh.

I'm curious about this because we're benchmarking our bulk indexing. And
we've found out that the fastest bulk indexing strategy to be:

bulk indexing with 0 replica, no refresh, let ES do as little merge as
possible
when indexing finished, optimize segments
replicates

Is there some readings about the details/internals of lucene? We've the
book Lucene in Action but it's mainly about core concepts and usage.

在 2014年3月28日星期五UTC+1下午8时32分46秒，Binh Ly写道：

The indexing buffer could also fill up which will flush to a segment. Also
the translog flush is not "exactly" deterministic, for example
"index.translog.interval" determines how often to check if the translog
needs to be flushed or not. Anyway, I wouldn't worry about it if I were
you. About the merge, I'd probably leave the defaults alone unless you are
absolutely sure changing them helps you. The more segments there are, the
more time it could take to do a merge.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/21034812-7c7e-4469-a3ad-7ceadde349e6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
ElasticSearch(0.90) How to make big segment at first place Elasticsearch	2	336	July 6, 2017
Elasticsearch Segment Size Elasticsearch	18	7757	July 5, 2017
ES creating thousands of segments with 1 document each Elasticsearch	5	877	July 5, 2017
Bulk indexing: single segment per shard Elasticsearch	4	700	February 26, 2019
Elasticsearch/Lucene Segment Count and Merge Settings Elasticsearch	1	412	April 14, 2021

Merge/segment understanding

Related topics