I created an index in a 4 nodes Elasticsearch cluster. I added about 3.5 M
documents using the java Elasticsearch API.
When asking for the stats i get a very high number in
throttle_time_in_millis as follows:
When you add documents to Elasticsearch, this creates new files on disk
that form what is called a segment. Having several segments is fine, but
when you start having too many of them, search is going to be slower this
is why Elasticsearch has a background process that takes care of merging
these segments, so that the total number of segments remains low enough
(usually in the order of ~50 per shard). However, running a background
merge can take lots of resources on the server, especially I/O, and this
might defeat the purpose of making search remain fast since search
operations don't have much I/O capacity left. In order to prevent it from
happening, merges are throttled[1], meaning that they can't write more than
X bytes of data per second. If they try to, Elasticsearch will pause them
for a while before they can keep on merging again.
The throttle_time reported by the stats API gives you the total number of
time that merges have been paused in order to prevent them from stealing
all the server I/O.
I created an index in a 4 nodes Elasticsearch cluster. I added about 3.5 M
documents using the java Elasticsearch API.
When asking for the stats i get a very high number in
throttle_time_in_millis as follows:
Is a segment a single file with multiple documents? Or is it multiple files
that together form a segment? In other terms I don't fully understand why
the notion of segment exists?
Does the fact that I have a high number in the throttling KPI mean that I
have a problem in performance and if so is there a setting to tune it
properly?
When you add documents to Elasticsearch, this creates new files on disk that
form what is called a segment. Having several segments is fine, but when you
start having too many of them, search is going to be slower this is why
Elasticsearch has a background process that takes care of merging these
segments, so that the total number of segments remains low enough (usually
in the order of ~50 per shard). However, running a background merge can take
lots of resources on the server, especially I/O, and this might defeat the
purpose of making search remain fast since search operations don't have much
I/O capacity left. In order to prevent it from happening, merges are
throttled[1], meaning that they can't write more than X bytes of data per
second. If they try to, Elasticsearch will pause them for a while before
they can keep on merging again.
The throttle_time reported by the stats API gives you the total number of
time that merges have been paused in order to prevent them from stealing all
the server I/O.
On Tuesday, March 4, 2014 3:01:17 PM UTC, Isaac Hazan wrote:
Thx.
Is a segment a single file with multiple documents? Or is it multiple
files that together form a segment? In other terms I don’t fully understand
why the notion of segment exists?
Does the fact that I have a high number in the throttling KPI mean that I
have a problem in performance and if so is there a setting to tune it
properly?
When you add documents to Elasticsearch, this creates new files on disk
that form what is called a segment. Having several segments is fine, but
when you start having too many of them, search is going to be slower this
is why Elasticsearch has a background process that takes care of merging
these segments, so that the total number of segments remains low enough
(usually in the order of ~50 per shard). However, running a background
merge can take lots of resources on the server, especially I/O, and this
might defeat the purpose of making search remain fast since search
operations don't have much I/O capacity left. In order to prevent it from
happening, merges are throttled[1], meaning that they can't write more than
X bytes of data per second. If they try to, Elasticsearch will pause them
for a while before they can keep on merging again.
The throttle_time reported by the stats API gives you the total number of
time that merges have been paused in order to prevent them from stealing
all the server I/O.
Is a segment a single file with multiple documents? Or is it multiple
files that together form a segment? In other terms I don't fully understand
why the notion of segment exists?
The simple answer is that a segment is made of several files. Typically,
there is one that is used to store "stored fields" (allowing to get the
original field values given a document ID), one for the terms dictionary
(the unique terms in your documents), one for postings lists (which given a
term can return the list of documents that contain this term), one for
deleted documents, etc.
And an index is the union of several segments. Searching an index is
effectively searching every segment and merging results together.
But for your information, there is an optimization called "compound file"
which allows to store all these logical files of one segment in a single
physical file when the segment is small. This helps save file descriptors.
Does the fact that I have a high number in the throttling KPI mean that
I have a problem in performance and if so is there a setting to tune it
properly?
A high throttling time is not necessarily an issue, it just means that
merges have been occasionally paused so that search remains fast. You can
disable merge throttling if you want by setting
index.store.throttle.max_bytes_per_sec[1] to -1.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.