Merge policy tuning hint

Hi,
I think I am seeing node performance degrades when the merge count ("merges":{"current": x, ...}} in node stats) increases to somewhere 6 to 10, and that results the node drop from the cluster.

I enabled index.merge DEBUG log then I see following logs for each shard:

[2012-12-12 18:29:44,136][DEBUG][index.merge.policy ] [Scourge of the Underworld] [i11][25] using [tiered] merge policy with expunge_deletes_allowed[10.0], floor_segment[2mb], max_merge_at_once[10], max_merge_at_once_explicit[30], max_merged_segment[5gb], segments_per_tier[10.0], reclaim_deletes_weight[2.0], async_merge[true]
[2012-12-12 18:29:44,136][DEBUG][index.merge.scheduler ] [Scourge of the Underworld] [i11][25] using [concurrent] merge scheduler with max_thread_count[3]

I'm using default merge policy on each node with 16 cores.
num_replicas is 1.
Each node has around 250 shards including primary and secondary shards.
ES is 0.19.3.

This morning I encountered the node drop, and at that time, the node dropped was doing many merges, according to the log. The time used to each merge seems increasing when the node dropped.
It started with relatively short time like this:

[2012-12-13 07:43:07,235][DEBUG][index.merge.scheduler ] [Scourge of the Underworld] [i13][27] merge [_72z9] done, took [54.8s]
(the size of this shard is 4.2GB)

Then the time increased to minutes (7:50 is the time when this node dropped from the cluster):

[2012-12-13 07:50:30,396][DEBUG][index.merge.scheduler ] [Scourge of the Underworld] [i15][3] merge [_16sc3] done, took [4.8m]
(the size of this shard is 5.8GB)

It kept increasing:

[2012-12-13 07:57:17,056][DEBUG][index.merge.scheduler ] [Scourge of the Underworld] [i531][22] merge [_6293] done, took [11m]
(the size of this shard is 3.6GB)

I'm not sure this is related to my problem (node drop), but my gut feeling is I might need some tuning on the merge policy.

Any hints or guidance would be very much appreciated.
Thank you in advance.

Hello,

You can reduce the number of concurrent merges by lowering
index.merge.policy.max_merge_at_once and
index.merge.policy.max_merge_at_once_explicit. Another thing that might
help is to lower the index.merge.scheduler.max_thread_count, especially
since the default is based on the number of processors. And you have many
of those :slight_smile:

Documentation for all the above settings can be found here:

250 shards per node is a lot. Can you bring that number down? How many
indices do you have?

Another interesting piece of information is what's overloaded when a node
drops (CPU, IO,etc)? Any interesting information in the logs?

If it's CPU, lowering the number of concurrent merges and threads should
help. If it's IO, you might additionally look at store level throttling,
especially for the type "merge":

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Thu, Dec 13, 2012 at 10:24 PM, arta artasano@sbcglobal.net wrote:

Hi,
I think I am seeing node performance degrades when the merge count
("merges":{"current": x, ...}} in node stats) increases to somewhere 6 to
10, and that results the node drop from the cluster.

I enabled index.merge DEBUG log then I see following logs for each shard:

[2012-12-12 18:29:44,136][DEBUG][index.merge.policy ] [Scourge of
the
Underworld] [i11][25] using [tiered] merge policy with
expunge_deletes_allowed[10.0], floor_segment[2mb], max_merge_at_once[10],
max_merge_at_once_explicit[30], max_merged_segment[5gb],
segments_per_tier[10.0], reclaim_deletes_weight[2.0], async_merge[true]
[2012-12-12 18:29:44,136][DEBUG][index.merge.scheduler ] [Scourge of
the
Underworld] [i11][25] using [concurrent] merge scheduler with
max_thread_count[3]

I'm using default merge policy on each node with 16 cores.
num_replicas is 1.
Each node has around 250 shards including primary and secondary shards.
ES is 0.19.3.

This morning I encountered the node drop, and at that time, the node
dropped
was doing many merges, according to the log. The time used to each merge
seems increasing when the node dropped.
It started with relatively short time like this:

[2012-12-13 07:43:07,235][DEBUG][index.merge.scheduler ] [Scourge of
the
Underworld] [i13][27] merge [_72z9] done, took [54.8s]
(the size of this shard is 4.2GB)

Then the time increased to minutes (7:50 is the time when this node
dropped
from the cluster):

[2012-12-13 07:50:30,396][DEBUG][index.merge.scheduler ] [Scourge of
the
Underworld] [i15][3] merge [_16sc3] done, took [4.8m]
(the size of this shard is 5.8GB)

It kept increasing:

[2012-12-13 07:57:17,056][DEBUG][index.merge.scheduler ] [Scourge of
the
Underworld] [i531][22] merge [_6293] done, took [11m]
(the size of this shard is 3.6GB)

I'm not sure this is related to my problem (node drop), but my gut feeling
is I might need some tuning on the merge policy.

Any hints or guidance would be very much appreciated.
Thank you for advance.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/merge-policy-tuning-hint-tp4026976.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--

--

Our experience has been that when a node drops in an elastic cluster
something pretty basic is wrong.
For example, you have too little memory, too little cpu, not enough disc,
not enough iops to support your workload.

Radhu courteously answered your original query about the segment merge
policy, but I fear that is not your actual problem and will not really help
you.

If you have 250 shards then you are putting tremendous pressure on the I/O
system whenever you are indexing. Lucene's architecture is carefully
designed to limit the amount of I/O on a per index basis. But by having
250 shards you are abusing this architecture, and this architecture is not
working for you, I suspect. Tweaking those parameters will probably just
change the time at which a node drops. Have you tried answering queries
while indexing? This will put more pressure on your cluster.

Of course, the community can not help you in any real way without knowing
real basics like:

What are you trying to accomplish?
How many items are you indexing and what is their basic structure?
How many nodes do you have in a cluster, how much memory, how much disc?
Why did you choose 250 shards? Our experience has been that the number of
shards should be somewhere around the number of nodes for each index plus a
factor of ~two (but not 200).

Hope this helps,
Randy

On Thu, Dec 13, 2012 at 12:24 PM, arta artasano@sbcglobal.net wrote:

Hi,
I think I am seeing node performance degrades when the merge count
("merges":{"current": x, ...}} in node stats) increases to somewhere 6 to
10, and that results the node drop from the cluster.

I enabled index.merge DEBUG log then I see following logs for each shard:

[2012-12-12 18:29:44,136][DEBUG][index.merge.policy ] [Scourge of the
Underworld] [i11][25] using [tiered] merge policy with
expunge_deletes_allowed[10.0], floor_segment[2mb], max_merge_at_once[10],
max_merge_at_once_explicit[30], max_merged_segment[5gb],
segments_per_tier[10.0], reclaim_deletes_weight[2.0], async_merge[true]
[2012-12-12 18:29:44,136][DEBUG][index.merge.scheduler ] [Scourge of the
Underworld] [i11][25] using [concurrent] merge scheduler with
max_thread_count[3]

I'm using default merge policy on each node with 16 cores.
num_replicas is 1.
Each node has around 250 shards including primary and secondary shards.
ES is 0.19.3.

This morning I encountered the node drop, and at that time, the node
dropped
was doing many merges, according to the log. The time used to each merge
seems increasing when the node dropped.
It started with relatively short time like this:

[2012-12-13 07:43:07,235][DEBUG][index.merge.scheduler ] [Scourge of the
Underworld] [i13][27] merge [_72z9] done, took [54.8s]
(the size of this shard is 4.2GB)

Then the time increased to minutes (7:50 is the time when this node dropped
from the cluster):

[2012-12-13 07:50:30,396][DEBUG][index.merge.scheduler ] [Scourge of the
Underworld] [i15][3] merge [_16sc3] done, took [4.8m]
(the size of this shard is 5.8GB)

It kept increasing:

[2012-12-13 07:57:17,056][DEBUG][index.merge.scheduler ] [Scourge of the
Underworld] [i531][22] merge [_6293] done, took [11m]
(the size of this shard is 3.6GB)

I'm not sure this is related to my problem (node drop), but my gut feeling
is I might need some tuning on the merge policy.

Any hints or guidance would be very much appreciated.
Thank you for advance.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/merge-policy-tuning-hint-tp4026976.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--

--

Thank you Radu.
I did check the merge policy document.
My understanding by reading it is:

  • max_merge_at_once: smaller value=less load at merging but longer merge period
  • max_merged_segment: smaller value=less load at merging but may slow down search
  • segments_per_tier: smaller value=more load at merging but faster search
  • max_thread_count: default is 3 for my 16 core machine = max(1, min(3, 16/2))

It seems to me they all have goods and bads, so that's why I asked for experts guidance.

As for # of shards per node, what number would be the ideal? 2, 5, 10, 20 50, or 100?

I wanted to have more than a couple of million indices at the beginning because documents we have can be well partitioned into that many small groups and searches are done within each group.
But it did not sound realistic, so I ended up with less than 100 indices, combining a couple of dozen thousand of these groups into one index. In order to achieve search performance based on the independence characteristics of the groups, I use routing. To make the routing more effective, I wanted to have as many shards as possible. Then with some research, I came up with a number, which is 32 shards per index.
This is the history why I have about 250 shards per node now.

Reading your feedback, I guess I made a bad decision and should have gone with smaller number of shards.
But, since we can't change the parameter # of shards per index, I want to handle what I have.

Our usage is 50% get, 45% index, and 5% search.
Overall traffic varies, but approx. 50-60 gets/sec, and 40-50 index/sec
We have about 20 data nodes for about 5000 shards.

I experimentally changed JVM heap size from 24GB to 16GB (the machine has 32GB) and the problem started to happen more frequently, so I changed it back to 24GB.
Perhaps I need more memory and/or more nodes, but if I can work around the problem by tweaking some parameters, that would be great for short time.

Thank you for the input, Randall.
I replied to Radu's input above and there I wrote my history about 250 shards.
You said:

Our experience has been that the number of shards should be somewhere around the number of nodes for each index plus a factor of ~two (but not 200).
Does that mean, in our case ~80 indices with 32 shards and 1 replica, I should have 64 nodes?

Well, I am still not sure what the use case is that drives you to think you
need so many indices. What I have done in the past is to simply have a
doctype field which distinguishes the different types of things that are
being indexed and put all items into the same index. Lucene handles this
really well--it efficiently restricts itself to the subset of the indexing
universe specified by whatever doctype has been specified as part of the
query. This has the advantage that you can then write queries over multiple
doctypes that would have required merging from separate indices. Although
elastic makes this easier than a lucene-only system I can't help thinking
that you would be better served with the one index approach.

So yes, if you really need 80 indices I would try to have 32-64 nodes each
hosting a small number of shards. There is no reason that the whole cluster
needs to be involved in every query, right?

If you have found that you need 24gb of heap that implies that a host
probably needs at least 48gb of memory by the rule of thumb indicating that
no more than half of heap should be devoted to the elastic jvm. I have also
found this to be true: we are currently at 19gb for the jvm out of 54gb for
the node.

But, again, by going to a few shards per index you can probably drastically
reduce the heap size. That is what I think you should try to do.

On Mon, Dec 17, 2012 at 11:50 AM, arta artasano@sbcglobal.net wrote:

Thank you for the input, Randall.
I replied to Radu's input above and there I wrote my history about 250
shards.
You said:

Our experience has been that the number of shards should be somewhere
around the number of nodes for each index plus a factor of ~two (but not
200).
Does that mean, in our case ~80 indices with 32 shards and 1 replica, I
should have 64 nodes?

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/merge-policy-tuning-hint-tp4026976p4027095.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--

--

(I'm a colleague of arta...)

RKM,

The reason behind the number of indices was to logically "tie" together the
backends we were sourcing our indexing from...at this point we can't afford
to restructure our cluster and reindex all of our data.

Looking at heap usage, although we have 24GB of heap allocated, committed
usage seems to hover around 18GB. We also have a RAID flash cache so any
writes that get sent to disk are written to the cache first, so disk I/O
hasn't really been much above 50-60%, even during our heaviest indexing...

When looking at a graph of our concurrent merges for the past month, the
gradual increase (then sudden spike) of merge activity coincides with our
node disconnects, so that's why we've been focused on trying to prevent a
merge race condition for recurring: Imgur: The magic of the Internet

Does this provide any further clarity as to what may be happening here?

On Monday, December 17, 2012 1:33:19 PM UTC-8, RKM wrote:

Well, I am still not sure what the use case is that drives you to think
you need so many indices. What I have done in the past is to simply have a
doctype field which distinguishes the different types of things that are
being indexed and put all items into the same index. Lucene handles this
really well--it efficiently restricts itself to the subset of the indexing
universe specified by whatever doctype has been specified as part of the
query. This has the advantage that you can then write queries over multiple
doctypes that would have required merging from separate indices. Although
elastic makes this easier than a lucene-only system I can't help thinking
that you would be better served with the one index approach.

So yes, if you really need 80 indices I would try to have 32-64 nodes each
hosting a small number of shards. There is no reason that the whole cluster
needs to be involved in every query, right?

If you have found that you need 24gb of heap that implies that a host
probably needs at least 48gb of memory by the rule of thumb indicating that
no more than half of heap should be devoted to the elastic jvm. I have also
found this to be true: we are currently at 19gb for the jvm out of 54gb for
the node.

But, again, by going to a few shards per index you can probably
drastically reduce the heap size. That is what I think you should try to do.

On Mon, Dec 17, 2012 at 11:50 AM, arta <arta...@sbcglobal.net<javascript:>

wrote:

Thank you for the input, Randall.
I replied to Radu's input above and there I wrote my history about 250
shards.
You said:

Our experience has been that the number of shards should be somewhere
around the number of nodes for each index plus a factor of ~two (but not
200).
Does that mean, in our case ~80 indices with 32 shards and 1 replica, I
should have 64 nodes?

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/merge-policy-tuning-hint-tp4026976p4027095.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--

--

Still not sure what each node looks like? Cores, memory, disc?

Can you succinctly summarize for the group what your cluster looks like?

I understand that for logical reasons you have many indices, but why does
each index have 250 shards?

You saw that merging was happening around the time of the node drop--but
can you tell us why it dropped exactly? Did it fail to respond to a
heartbeat? Did the wrapper kill elasticsearch? There might be more than one
problem here.

Thanks,
Randy

On Thu, Dec 13, 2012 at 12:24 PM, arta artasano@sbcglobal.net wrote:

Hi,
I think I am seeing node performance degrades when the merge count
("merges":{"current": x, ...}} in node stats) increases to somewhere 6 to
10, and that results the node drop from the cluster.

I enabled index.merge DEBUG log then I see following logs for each shard:

[2012-12-12 18:29:44,136][DEBUG][index.merge.policy ] [Scourge of the
Underworld] [i11][25] using [tiered] merge policy with
expunge_deletes_allowed[10.0], floor_segment[2mb], max_merge_at_once[10],
max_merge_at_once_explicit[30], max_merged_segment[5gb],
segments_per_tier[10.0], reclaim_deletes_weight[2.0], async_merge[true]
[2012-12-12 18:29:44,136][DEBUG][index.merge.scheduler ] [Scourge of the
Underworld] [i11][25] using [concurrent] merge scheduler with
max_thread_count[3]

I'm using default merge policy on each node with 16 cores.
num_replicas is 1.
Each node has around 250 shards including primary and secondary shards.
ES is 0.19.3.

This morning I encountered the node drop, and at that time, the node
dropped
was doing many merges, according to the log. The time used to each merge
seems increasing when the node dropped.
It started with relatively short time like this:

[2012-12-13 07:43:07,235][DEBUG][index.merge.scheduler ] [Scourge of the
Underworld] [i13][27] merge [_72z9] done, took [54.8s]
(the size of this shard is 4.2GB)

Then the time increased to minutes (7:50 is the time when this node dropped
from the cluster):

[2012-12-13 07:50:30,396][DEBUG][index.merge.scheduler ] [Scourge of the
Underworld] [i15][3] merge [_16sc3] done, took [4.8m]
(the size of this shard is 5.8GB)

It kept increasing:

[2012-12-13 07:57:17,056][DEBUG][index.merge.scheduler ] [Scourge of the
Underworld] [i531][22] merge [_6293] done, took [11m]
(the size of this shard is 3.6GB)

I'm not sure this is related to my problem (node drop), but my gut feeling
is I might need some tuning on the merge policy.

Any hints or guidance would be very much appreciated.
Thank you for advance.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/merge-policy-tuning-hint-tp4026976.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--

--

Sorry, just to clarify...each index has only 32 shards with one replica,
but we have 250 shards per host.

Cluster in a nutshell:
Total combined index size: 23.4 TB

-19 data nodes

  • 79 indices
  • 32 shards + 1 replica shard = 5056 total shards in cluster
  • 5056 shareds / 19 data nodes = approx. 266 shards/node (we allocate
    shards evenly and randomly across the cluster, so any index will have
    shards that span across most all of the hosts in the cluster)

For each node:

16 cores (16xIntel(R) Xeon(R) CPU L5520 @ 2.27GHz)
32GB memory (8X 4096 MB/1333 MHz/DIMM/DDR3),

Local OS has 2x250GBx7200RPM in a mirrored RAID.

Separate RAID-10 for ES index data - Seagate 15K 450GB, total logical drive
size 2.7TB

Leading up the the node drop, we did receive ping responses, but since they
were past the timeout threshold we have set (120 seconds), they were
disregarded and the node got removed due to a failed heartbeat...this was
happening for a couple of hours leading up to the node dropping, so likely
what happened is that a couple of ping requests would timeout, but one
would get through before it timed out:

[2012-12-10 07:14:52,411][WARN ][transport ] [Farallah]
Received response for a request that has timed out, sent [267929ms] ago,
timed out [147928ms] ago, action [discovery/zen/fd/ping], node [[Mark
Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]], id
[56024100]

[2012-12-10 11:45:44,488][WARN ][transport ] [Farallah]
Received response for a request that has timed out, sent [193187ms] ago,
timed out [73186ms] ago, action [discovery/zen/fd/ping], node [[Mark
Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]], id
[57862298]

[2012-12-10 11:51:45,528][INFO ][cluster.service ] [Farallah]
removed {[Mark Gervaisnight
shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]],}, reason:
zen-disco-node_failed([Mark Gervaisnight
shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]), reason failed to
ping, tried [3] times, each with maximum [2m] timeout

On Tuesday, December 18, 2012 2:23:16 PM UTC-8, RKM wrote:

Still not sure what each node looks like? Cores, memory, disc?

Can you succinctly summarize for the group what your cluster looks like?

I understand that for logical reasons you have many indices, but why does
each index have 250 shards?

You saw that merging was happening around the time of the node drop--but
can you tell us why it dropped exactly? Did it fail to respond to a
heartbeat? Did the wrapper kill elasticsearch? There might be more than one
problem here.

Thanks,
Randy

On Thu, Dec 13, 2012 at 12:24 PM, arta <arta...@sbcglobal.net<javascript:>

wrote:

Hi,
I think I am seeing node performance degrades when the merge count
("merges":{"current": x, ...}} in node stats) increases to somewhere 6 to
10, and that results the node drop from the cluster.

I enabled index.merge DEBUG log then I see following logs for each shard:

[2012-12-12 18:29:44,136][DEBUG][index.merge.policy ] [Scourge of
the
Underworld] [i11][25] using [tiered] merge policy with
expunge_deletes_allowed[10.0], floor_segment[2mb], max_merge_at_once[10],
max_merge_at_once_explicit[30], max_merged_segment[5gb],
segments_per_tier[10.0], reclaim_deletes_weight[2.0], async_merge[true]
[2012-12-12 18:29:44,136][DEBUG][index.merge.scheduler ] [Scourge of
the
Underworld] [i11][25] using [concurrent] merge scheduler with
max_thread_count[3]

I'm using default merge policy on each node with 16 cores.
num_replicas is 1.
Each node has around 250 shards including primary and secondary shards.
ES is 0.19.3.

This morning I encountered the node drop, and at that time, the node
dropped
was doing many merges, according to the log. The time used to each merge
seems increasing when the node dropped.
It started with relatively short time like this:

[2012-12-13 07:43:07,235][DEBUG][index.merge.scheduler ] [Scourge of
the
Underworld] [i13][27] merge [_72z9] done, took [54.8s]
(the size of this shard is 4.2GB)

Then the time increased to minutes (7:50 is the time when this node
dropped
from the cluster):

[2012-12-13 07:50:30,396][DEBUG][index.merge.scheduler ] [Scourge of
the
Underworld] [i15][3] merge [_16sc3] done, took [4.8m]
(the size of this shard is 5.8GB)

It kept increasing:

[2012-12-13 07:57:17,056][DEBUG][index.merge.scheduler ] [Scourge of
the
Underworld] [i531][22] merge [_6293] done, took [11m]
(the size of this shard is 3.6GB)

I'm not sure this is related to my problem (node drop), but my gut feeling
is I might need some tuning on the merge policy.

Any hints or guidance would be very much appreciated.
Thank you for advance.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/merge-policy-tuning-hint-tp4026976.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--

--

Hi,

I didn't carefully read this thread, but that seems like a high number of
shards/server. Does a single query hit all of them? If so, that's a
biiiig red flag.

How did you come up with "32 shards" number? Why not 22? Or 19?

A quick way of testing whether you have too many shards and if that's the
source of the problem is to create a test cluster somewhere, take something
like a Wikipedia dump, index it into an index with a high number of shards
and do a performance test while observing various performance metrics (see
my signature). Then drop the index and create a new one with half the
shards and repeat. This will give you the feel for how much impact a high
number of shards has on perf.

Otis

ELASTICSEARCH Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Tuesday, December 18, 2012 6:08:34 PM UTC-5, Ethan Shrago wrote:

Sorry, just to clarify...each index has only 32 shards with one replica,
but we have 250 shards per host.

Cluster in a nutshell:
Total combined index size: 23.4 TB

-19 data nodes

  • 79 indices
  • 32 shards + 1 replica shard = 5056 total shards in cluster
  • 5056 shareds / 19 data nodes = approx. 266 shards/node (we allocate
    shards evenly and randomly across the cluster, so any index will have
    shards that span across most all of the hosts in the cluster)

For each node:

16 cores (16xIntel(R) Xeon(R) CPU L5520 @ 2.27GHz)
32GB memory (8X 4096 MB/1333 MHz/DIMM/DDR3),

Local OS has 2x250GBx7200RPM in a mirrored RAID.

Separate RAID-10 for ES index data - Seagate 15K 450GB, total logical
drive size 2.7TB

Leading up the the node drop, we did receive ping responses, but since
they were past the timeout threshold we have set (120 seconds), they were
disregarded and the node got removed due to a failed heartbeat...this was
happening for a couple of hours leading up to the node dropping, so likely
what happened is that a couple of ping requests would timeout, but one
would get through before it timed out:

[2012-12-10 07:14:52,411][WARN ][transport ] [Farallah]
Received response for a request that has timed out, sent [267929ms] ago,
timed out [147928ms] ago, action [discovery/zen/fd/ping], node [[Mark
Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]], id
[56024100]

[2012-12-10 11:45:44,488][WARN ][transport ] [Farallah]
Received response for a request that has timed out, sent [193187ms] ago,
timed out [73186ms] ago, action [discovery/zen/fd/ping], node [[Mark
Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]], id
[57862298]

[2012-12-10 11:51:45,528][INFO ][cluster.service ] [Farallah]
removed {[Mark Gervaisnight
shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]],}, reason:
zen-disco-node_failed([Mark Gervaisnight
shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]), reason failed to
ping, tried [3] times, each with maximum [2m] timeout

On Tuesday, December 18, 2012 2:23:16 PM UTC-8, RKM wrote:

Still not sure what each node looks like? Cores, memory, disc?

Can you succinctly summarize for the group what your cluster looks like?

I understand that for logical reasons you have many indices, but why does
each index have 250 shards?

You saw that merging was happening around the time of the node drop--but
can you tell us why it dropped exactly? Did it fail to respond to a
heartbeat? Did the wrapper kill elasticsearch? There might be more than one
problem here.

Thanks,
Randy

On Thu, Dec 13, 2012 at 12:24 PM, arta arta...@sbcglobal.net wrote:

Hi,
I think I am seeing node performance degrades when the merge count
("merges":{"current": x, ...}} in node stats) increases to somewhere 6 to
10, and that results the node drop from the cluster.

I enabled index.merge DEBUG log then I see following logs for each shard:

[2012-12-12 18:29:44,136][DEBUG][index.merge.policy ] [Scourge of
the
Underworld] [i11][25] using [tiered] merge policy with
expunge_deletes_allowed[10.0], floor_segment[2mb], max_merge_at_once[10],
max_merge_at_once_explicit[30], max_merged_segment[5gb],
segments_per_tier[10.0], reclaim_deletes_weight[2.0], async_merge[true]
[2012-12-12 18:29:44,136][DEBUG][index.merge.scheduler ] [Scourge of
the
Underworld] [i11][25] using [concurrent] merge scheduler with
max_thread_count[3]

I'm using default merge policy on each node with 16 cores.
num_replicas is 1.
Each node has around 250 shards including primary and secondary shards.
ES is 0.19.3.

This morning I encountered the node drop, and at that time, the node
dropped
was doing many merges, according to the log. The time used to each merge
seems increasing when the node dropped.
It started with relatively short time like this:

[2012-12-13 07:43:07,235][DEBUG][index.merge.scheduler ] [Scourge of
the
Underworld] [i13][27] merge [_72z9] done, took [54.8s]
(the size of this shard is 4.2GB)

Then the time increased to minutes (7:50 is the time when this node
dropped
from the cluster):

[2012-12-13 07:50:30,396][DEBUG][index.merge.scheduler ] [Scourge of
the
Underworld] [i15][3] merge [_16sc3] done, took [4.8m]
(the size of this shard is 5.8GB)

It kept increasing:

[2012-12-13 07:57:17,056][DEBUG][index.merge.scheduler ] [Scourge of
the
Underworld] [i531][22] merge [_6293] done, took [11m]
(the size of this shard is 3.6GB)

I'm not sure this is related to my problem (node drop), but my gut
feeling
is I might need some tuning on the merge policy.

Any hints or guidance would be very much appreciated.
Thank you for advance.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/merge-policy-tuning-hint-tp4026976.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--

--

I work with Art and Ethan. A little background: we didn't target a specific
number of shards per server; instead we divided our user base into
partitions, and broke each partition up into a fixed number of shards, with
the built-in assumption that the indexes will get much, much bigger in the
future.

No, no single query hits every shard.

Just to restate some of the problem, things perform fine for a few days
after the cluster is started up. Then, we notice the number of current
merges (as reported by ES) start to grow until finally nodes start dropping
out and everything falls apart. We are not sure whether the current merges
metric is the cause of the problem or an effect.

The question we are wrestling with is, what changes between the first 2-3
days,when the number of merges seems relatively stable and the cluster is
performing great, and the next several days, when the number of merges
jumps up and everything falls apart.

During this time, heap usage in the JVM looks good and disk I/O looks fine

  • the utilization percentage reported by iostat is under 50%. CPU seems to
    increase with the number of active merges, but doesn't seem to max out.

Thoughts? We're actually in a period right now in which the number of
merges has started to climb, but no one has dropped out of the cluster
(yet). The pattern is consistent enough that we can look at the merge
activity graphs and anticipate when the next failure will occur.

On Tuesday, December 18, 2012 9:24:11 PM UTC-8, Otis Gospodnetic wrote:

Hi,

I didn't carefully read this thread, but that seems like a high number of
shards/server. Does a single query hit all of them? If so, that's a
biiiig red flag.

How did you come up with "32 shards" number? Why not 22? Or 19?

A quick way of testing whether you have too many shards and if that's the
source of the problem is to create a test cluster somewhere, take something
like a Wikipedia dump, index it into an index with a high number of shards
and do a performance test while observing various performance metrics (see
my signature). Then drop the index and create a new one with half the
shards and repeat. This will give you the feel for how much impact a high
number of shards has on perf.

Otis

ELASTICSEARCH Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Tuesday, December 18, 2012 6:08:34 PM UTC-5, Ethan Shrago wrote:

Sorry, just to clarify...each index has only 32 shards with one replica,
but we have 250 shards per host.

Cluster in a nutshell:
Total combined index size: 23.4 TB

-19 data nodes

  • 79 indices
  • 32 shards + 1 replica shard = 5056 total shards in cluster
  • 5056 shareds / 19 data nodes = approx. 266 shards/node (we allocate
    shards evenly and randomly across the cluster, so any index will have
    shards that span across most all of the hosts in the cluster)

For each node:

16 cores (16xIntel(R) Xeon(R) CPU L5520 @ 2.27GHz)
32GB memory (8X 4096 MB/1333 MHz/DIMM/DDR3),

Local OS has 2x250GBx7200RPM in a mirrored RAID.

Separate RAID-10 for ES index data - Seagate 15K 450GB, total logical
drive size 2.7TB

Leading up the the node drop, we did receive ping responses, but since
they were past the timeout threshold we have set (120 seconds), they were
disregarded and the node got removed due to a failed heartbeat...this was
happening for a couple of hours leading up to the node dropping, so likely
what happened is that a couple of ping requests would timeout, but one
would get through before it timed out:

[2012-12-10 07:14:52,411][WARN ][transport ] [Farallah]
Received response for a request that has timed out, sent [267929ms] ago,
timed out [147928ms] ago, action [discovery/zen/fd/ping], node [[Mark
Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]], id
[56024100]

[2012-12-10 11:45:44,488][WARN ][transport ] [Farallah]
Received response for a request that has timed out, sent [193187ms] ago,
timed out [73186ms] ago, action [discovery/zen/fd/ping], node [[Mark
Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]], id
[57862298]

[2012-12-10 11:51:45,528][INFO ][cluster.service ] [Farallah]
removed {[Mark Gervaisnight
shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]],}, reason:
zen-disco-node_failed([Mark Gervaisnight
shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]), reason failed to
ping, tried [3] times, each with maximum [2m] timeout

On Tuesday, December 18, 2012 2:23:16 PM UTC-8, RKM wrote:

Still not sure what each node looks like? Cores, memory, disc?

Can you succinctly summarize for the group what your cluster looks like?

I understand that for logical reasons you have many indices, but why
does each index have 250 shards?

You saw that merging was happening around the time of the node drop--but
can you tell us why it dropped exactly? Did it fail to respond to a
heartbeat? Did the wrapper kill elasticsearch? There might be more than one
problem here.

Thanks,
Randy

On Thu, Dec 13, 2012 at 12:24 PM, arta arta...@sbcglobal.net wrote:

Hi,
I think I am seeing node performance degrades when the merge count
("merges":{"current": x, ...}} in node stats) increases to somewhere 6
to
10, and that results the node drop from the cluster.

I enabled index.merge DEBUG log then I see following logs for each
shard:

[2012-12-12 18:29:44,136][DEBUG][index.merge.policy ] [Scourge of
the
Underworld] [i11][25] using [tiered] merge policy with
expunge_deletes_allowed[10.0], floor_segment[2mb],
max_merge_at_once[10],
max_merge_at_once_explicit[30], max_merged_segment[5gb],
segments_per_tier[10.0], reclaim_deletes_weight[2.0], async_merge[true]
[2012-12-12 18:29:44,136][DEBUG][index.merge.scheduler ] [Scourge of
the
Underworld] [i11][25] using [concurrent] merge scheduler with
max_thread_count[3]

I'm using default merge policy on each node with 16 cores.
num_replicas is 1.
Each node has around 250 shards including primary and secondary shards.
ES is 0.19.3.

This morning I encountered the node drop, and at that time, the node
dropped
was doing many merges, according to the log. The time used to each merge
seems increasing when the node dropped.
It started with relatively short time like this:

[2012-12-13 07:43:07,235][DEBUG][index.merge.scheduler ] [Scourge of
the
Underworld] [i13][27] merge [_72z9] done, took [54.8s]
(the size of this shard is 4.2GB)

Then the time increased to minutes (7:50 is the time when this node
dropped
from the cluster):

[2012-12-13 07:50:30,396][DEBUG][index.merge.scheduler ] [Scourge of
the
Underworld] [i15][3] merge [_16sc3] done, took [4.8m]
(the size of this shard is 5.8GB)

It kept increasing:

[2012-12-13 07:57:17,056][DEBUG][index.merge.scheduler ] [Scourge of
the
Underworld] [i531][22] merge [_6293] done, took [11m]
(the size of this shard is 3.6GB)

I'm not sure this is related to my problem (node drop), but my gut
feeling
is I might need some tuning on the merge policy.

Any hints or guidance would be very much appreciated.
Thank you for advance.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/merge-policy-tuning-hint-tp4026976.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--

--

This would seem to be relevant:

Lucene mergeshttp://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

Elastic is simply building on top of lucene. You can "tweak" the merge
factor but to get good results but you will need to understand the
extensive tradeoffs involved. However, its a slight investment compared to
what you have done so far.

Still, looking at your system from a high-level I believe that merge
factors will be a band-aid and the only real relief is to re-architect such
that you have a "reasonable" number of shards per node. For example, if are
really needing 24gb of jvm heap then you probably need (at least) 48gb of
memory per node. Etc.

On Wed, Dec 19, 2012 at 8:30 AM, Jim Wong jimwong@alumni.rice.edu wrote:

I work with Art and Ethan. A little background: we didn't target a
specific number of shards per server; instead we divided our user base into
partitions, and broke each partition up into a fixed number of shards, with
the built-in assumption that the indexes will get much, much bigger in the
future.

No, no single query hits every shard.

Just to restate some of the problem, things perform fine for a few days
after the cluster is started up. Then, we notice the number of current
merges (as reported by ES) start to grow until finally nodes start dropping
out and everything falls apart. We are not sure whether the current merges
metric is the cause of the problem or an effect.

The question we are wrestling with is, what changes between the first 2-3
days,when the number of merges seems relatively stable and the cluster is
performing great, and the next several days, when the number of merges
jumps up and everything falls apart.

During this time, heap usage in the JVM looks good and disk I/O looks fine

  • the utilization percentage reported by iostat is under 50%. CPU seems to
    increase with the number of active merges, but doesn't seem to max out.

Thoughts? We're actually in a period right now in which the number of
merges has started to climb, but no one has dropped out of the cluster
(yet). The pattern is consistent enough that we can look at the merge
activity graphs and anticipate when the next failure will occur.

On Tuesday, December 18, 2012 9:24:11 PM UTC-8, Otis Gospodnetic wrote:

Hi,

I didn't carefully read this thread, but that seems like a high number of
shards/server. Does a single query hit all of them? If so, that's a
biiiig red flag.

How did you come up with "32 shards" number? Why not 22? Or 19?

A quick way of testing whether you have too many shards and if that's the
source of the problem is to create a test cluster somewhere, take something
like a Wikipedia dump, index it into an index with a high number of shards
and do a performance test while observing various performance metrics (see
my signature). Then drop the index and create a new one with half the
shards and repeat. This will give you the feel for how much impact a high
number of shards has on perf.

Otis

ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.**
html http://sematext.com/spm/index.html

On Tuesday, December 18, 2012 6:08:34 PM UTC-5, Ethan Shrago wrote:

Sorry, just to clarify...each index has only 32 shards with one replica,
but we have 250 shards per host.

Cluster in a nutshell:
Total combined index size: 23.4 TB

-19 data nodes

  • 79 indices
  • 32 shards + 1 replica shard = 5056 total shards in cluster
  • 5056 shareds / 19 data nodes = approx. 266 shards/node (we allocate
    shards evenly and randomly across the cluster, so any index will have
    shards that span across most all of the hosts in the cluster)

For each node:

16 cores (16xIntel(R) Xeon(R) CPU L5520 @ 2.27GHz)
32GB memory (8X 4096 MB/1333 MHz/DIMM/DDR3),

Local OS has 2x250GBx7200RPM in a mirrored RAID.

Separate RAID-10 for ES index data - Seagate 15K 450GB, total logical
drive size 2.7TB

Leading up the the node drop, we did receive ping responses, but since
they were past the timeout threshold we have set (120 seconds), they were
disregarded and the node got removed due to a failed heartbeat...this was
happening for a couple of hours leading up to the node dropping, so likely
what happened is that a couple of ping requests would timeout, but one
would get through before it timed out:

[2012-12-10 07:14:52,411][WARN ][transport ] [Farallah]
Received response for a request that has timed out, sent [267929ms] ago,
timed out [147928ms] ago, action [discovery/zen/fd/ping], node [[Mark
Gervaisnight shade][QHjSz782RqOACkEc5IRmCg]**[inet[/10.5.120.88:9300]]],
id [56024100]

[2012-12-10 11:45:44,488][WARN ][transport ] [Farallah]
Received response for a request that has timed out, sent [193187ms] ago,
timed out [73186ms] ago, action [discovery/zen/fd/ping], node [[Mark
Gervaisnight shade][QHjSz782RqOACkEc5IRmCg]**[inet[/10.5.120.88:9300]]],
id [57862298]

[2012-12-10 11:51:45,528][INFO ][cluster.service ] [Farallah]
removed {[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg]**
[inet[/10.5.120.88:9300]],}, reason: zen-disco-node_failed([Mark
Gervaisnight shade][QHjSz782RqOACkEc5IRmCg]**[inet[/10.5.120.88:9300]]),
reason failed to ping, tried [3] times, each with maximum [2m] timeout

On Tuesday, December 18, 2012 2:23:16 PM UTC-8, RKM wrote:

Still not sure what each node looks like? Cores, memory, disc?

Can you succinctly summarize for the group what your cluster looks like?

I understand that for logical reasons you have many indices, but why
does each index have 250 shards?

You saw that merging was happening around the time of the node
drop--but can you tell us why it dropped exactly? Did it fail to respond
to a heartbeat? Did the wrapper kill elasticsearch? There might be more
than one problem here.

Thanks,
Randy

On Thu, Dec 13, 2012 at 12:24 PM, arta arta...@sbcglobal.net wrote:

Hi,
I think I am seeing node performance degrades when the merge count
("merges":{"current": x, ...}} in node stats) increases to somewhere 6
to
10, and that results the node drop from the cluster.

I enabled index.merge DEBUG log then I see following logs for each
shard:

[2012-12-12 18:29:44,136][DEBUG][index.merge.policy ]
[Scourge of the
Underworld] [i11][25] using [tiered] merge policy with
expunge_deletes_allowed[10.0], floor_segment[2mb],
max_merge_at_once[10],
max_merge_at_once_explicit[30]
, max_merged_segment[5gb],
segments_per_tier[10.0], reclaim_deletes_weight[2.0], async_merge[true]
[2012-12-12 18:29:44,136][DEBUG][index.**merge.scheduler ]
[Scourge of the
Underworld] [i11][25] using [concurrent] merge scheduler with
max_thread_count[3]

I'm using default merge policy on each node with 16 cores.
num_replicas is 1.
Each node has around 250 shards including primary and secondary shards.
ES is 0.19.3.

This morning I encountered the node drop, and at that time, the node
dropped
was doing many merges, according to the log. The time used to each
merge
seems increasing when the node dropped.
It started with relatively short time like this:

[2012-12-13 07:43:07,235][DEBUG][index.**merge.scheduler ]
[Scourge of the
Underworld] [i13][27] merge [_72z9] done, took [54.8s]
(the size of this shard is 4.2GB)

Then the time increased to minutes (7:50 is the time when this node
dropped
from the cluster):

[2012-12-13 07:50:30,396][DEBUG][index.**merge.scheduler ]
[Scourge of the
Underworld] [i15][3] merge [_16sc3] done, took [4.8m]
(the size of this shard is 5.8GB)

It kept increasing:

[2012-12-13 07:57:17,056][DEBUG][index.**merge.scheduler ]
[Scourge of the
Underworld] [i531][22] merge [_6293] done, took [11m]
(the size of this shard is 3.6GB)

I'm not sure this is related to my problem (node drop), but my gut
feeling
is I might need some tuning on the merge policy.

Any hints or guidance would be very much appreciated.
Thank you for advance.

--
View this message in context: http://elasticsearch-users.**
115913.n3.nabble.com/merge-**policy-tuning-hint-tp4026976.**htmlhttp://elasticsearch-users.115913.n3.nabble.com/merge-policy-tuning-hint-tp4026976.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--

--

--

Thanks for the response. We'll check out that link.

Can you elaborate on where you think the bottleneck is that's leading to
this problem?

Under normal circumstances (the first couple days after the cluster comes
up), there don't seem to be any obvious resource constraints arising from
the large number of shards: CPU utilization is very reasonable (perhaps
around 30%), and I/O utilization seems reasonable, as well (under 40%). GC
logging seems to indicate that we're not having issues on that front.
Everything seems fine and dandy for a couple days.

Over time, however, something seems to degrade: the number of merges the
system completes decreases dramatically, and the number of slow merges
increases (based on the debug logs ES produces). This is an interesting
period, because the cluster outwardly seems to be healthy, but we can tell
from the metrics we're collecting that things are starting go downhill. CPU
and disk I/O still seem fine, but for some reason there seems to be a
problem with merges.

Eventually - a few days after things start to slow down - the number of
threads in use skyrockets, the heap gets completely consumed, and nodes
start dropping out of the cluster. This is the end stage of the disease:
we've hit a tipping point and are dead.

We have some numbers (and maybe graphs) we might be able to share that
illustrate what's happening. To solve the problem, we'd like to understand
what resource we're low on before we jump into re-architecting the cluster,
lest we run into a similar problem again.

In the meantime, we are going to try adding another 32 GB of RAM to one of
our nodes while leaving the heap size unchanged to see if it changes the
pattern. I'm a little pessimistic on this front because I/O doesn't seem
to be a huge bottleneck right now (based on the output of iostat), but it
does seem possible that just fractionally improving merge performance
through additional buffer space could help if we are very slowly falling
behind.

On Wednesday, December 19, 2012 3:55:17 PM UTC-8, RKM wrote:

This would seem to be relevant:

Lucene mergeshttp://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

Elastic is simply building on top of lucene. You can "tweak" the merge
factor but to get good results but you will need to understand the
extensive tradeoffs involved. However, its a slight investment compared to
what you have done so far.

Still, looking at your system from a high-level I believe that merge
factors will be a band-aid and the only real relief is to re-architect such
that you have a "reasonable" number of shards per node. For example, if are
really needing 24gb of jvm heap then you probably need (at least) 48gb of
memory per node. Etc.

On Wed, Dec 19, 2012 at 8:30 AM, Jim Wong <jim...@alumni.rice.edu<javascript:>

wrote:

I work with Art and Ethan. A little background: we didn't target a
specific number of shards per server; instead we divided our user base into
partitions, and broke each partition up into a fixed number of shards, with
the built-in assumption that the indexes will get much, much bigger in the
future.

No, no single query hits every shard.

Just to restate some of the problem, things perform fine for a few days
after the cluster is started up. Then, we notice the number of current
merges (as reported by ES) start to grow until finally nodes start dropping
out and everything falls apart. We are not sure whether the current merges
metric is the cause of the problem or an effect.

The question we are wrestling with is, what changes between the first 2-3
days,when the number of merges seems relatively stable and the cluster is
performing great, and the next several days, when the number of merges
jumps up and everything falls apart.

During this time, heap usage in the JVM looks good and disk I/O looks
fine - the utilization percentage reported by iostat is under 50%. CPU
seems to increase with the number of active merges, but doesn't seem to max
out.

Thoughts? We're actually in a period right now in which the number of
merges has started to climb, but no one has dropped out of the cluster
(yet). The pattern is consistent enough that we can look at the merge
activity graphs and anticipate when the next failure will occur.

On Tuesday, December 18, 2012 9:24:11 PM UTC-8, Otis Gospodnetic wrote:

Hi,

I didn't carefully read this thread, but that seems like a high number
of shards/server. Does a single query hit all of them? If so, that's a
biiiig red flag.

How did you come up with "32 shards" number? Why not 22? Or 19?

A quick way of testing whether you have too many shards and if that's
the source of the problem is to create a test cluster somewhere, take
something like a Wikipedia dump, index it into an index with a high number
of shards and do a performance test while observing various performance
metrics (see my signature). Then drop the index and create a new one with
half the shards and repeat. This will give you the feel for how much
impact a high number of shards has on perf.

Otis

ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.**
html http://sematext.com/spm/index.html

On Tuesday, December 18, 2012 6:08:34 PM UTC-5, Ethan Shrago wrote:

Sorry, just to clarify...each index has only 32 shards with one
replica, but we have 250 shards per host.

Cluster in a nutshell:
Total combined index size: 23.4 TB

-19 data nodes

  • 79 indices
  • 32 shards + 1 replica shard = 5056 total shards in cluster
  • 5056 shareds / 19 data nodes = approx. 266 shards/node (we allocate
    shards evenly and randomly across the cluster, so any index will have
    shards that span across most all of the hosts in the cluster)

For each node:

16 cores (16xIntel(R) Xeon(R) CPU L5520 @ 2.27GHz)
32GB memory (8X 4096 MB/1333 MHz/DIMM/DDR3),

Local OS has 2x250GBx7200RPM in a mirrored RAID.

Separate RAID-10 for ES index data - Seagate 15K 450GB, total logical
drive size 2.7TB

Leading up the the node drop, we did receive ping responses, but since
they were past the timeout threshold we have set (120 seconds), they were
disregarded and the node got removed due to a failed heartbeat...this was
happening for a couple of hours leading up to the node dropping, so likely
what happened is that a couple of ping requests would timeout, but one
would get through before it timed out:

[2012-12-10 07:14:52,411][WARN ][transport ] [Farallah]
Received response for a request that has timed out, sent [267929ms] ago,
timed out [147928ms] ago, action [discovery/zen/fd/ping], node [[Mark
Gervaisnight shade][QHjSz782RqOACkEc5IRmCg]**[inet[/10.5.120.88:9300]]],
id [56024100]

[2012-12-10 11:45:44,488][WARN ][transport ] [Farallah]
Received response for a request that has timed out, sent [193187ms] ago,
timed out [73186ms] ago, action [discovery/zen/fd/ping], node [[Mark
Gervaisnight shade][QHjSz782RqOACkEc5IRmCg]**[inet[/10.5.120.88:9300]]],
id [57862298]

[2012-12-10 11:51:45,528][INFO ][cluster.service ] [Farallah]
removed {[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]],},
reason: zen-disco-node_failed([Mark Gervaisnight
shade][QHjSz782RqOACkEc5IRmCg]
[inet[/10.5.120.88:9300]]), reason
failed to ping, tried [3] times, each with maximum [2m] timeout

On Tuesday, December 18, 2012 2:23:16 PM UTC-8, RKM wrote:

Still not sure what each node looks like? Cores, memory, disc?

Can you succinctly summarize for the group what your cluster looks
like?

I understand that for logical reasons you have many indices, but why
does each index have 250 shards?

You saw that merging was happening around the time of the node
drop--but can you tell us why it dropped exactly? Did it fail to respond
to a heartbeat? Did the wrapper kill elasticsearch? There might be more
than one problem here.

Thanks,
Randy

On Thu, Dec 13, 2012 at 12:24 PM, arta arta...@sbcglobal.net wrote:

Hi,
I think I am seeing node performance degrades when the merge count
("merges":{"current": x, ...}} in node stats) increases to somewhere
6 to
10, and that results the node drop from the cluster.

I enabled index.merge DEBUG log then I see following logs for each
shard:

[2012-12-12 18:29:44,136][DEBUG][index.merge.policy ]
[Scourge of the
Underworld] [i11][25] using [tiered] merge policy with
expunge_deletes_allowed[10.0], floor_segment[2mb],
max_merge_at_once[10],
max_merge_at_once_explicit[30]
, max_merged_segment[5gb],
segments_per_tier[10.0], reclaim_deletes_weight[2.0],
async_merge[true]
[2012-12-12 18:29:44,136][DEBUG][index.**merge.scheduler ]
[Scourge of the
Underworld] [i11][25] using [concurrent] merge scheduler with
max_thread_count[3]

I'm using default merge policy on each node with 16 cores.
num_replicas is 1.
Each node has around 250 shards including primary and secondary
shards.
ES is 0.19.3.

This morning I encountered the node drop, and at that time, the node
dropped
was doing many merges, according to the log. The time used to each
merge
seems increasing when the node dropped.
It started with relatively short time like this:

[2012-12-13 07:43:07,235][DEBUG][index.**merge.scheduler ]
[Scourge of the
Underworld] [i13][27] merge [_72z9] done, took [54.8s]
(the size of this shard is 4.2GB)

Then the time increased to minutes (7:50 is the time when this node
dropped
from the cluster):

[2012-12-13 07:50:30,396][DEBUG][index.**merge.scheduler ]
[Scourge of the
Underworld] [i15][3] merge [_16sc3] done, took [4.8m]
(the size of this shard is 5.8GB)

It kept increasing:

[2012-12-13 07:57:17,056][DEBUG][index.**merge.scheduler ]
[Scourge of the
Underworld] [i531][22] merge [_6293] done, took [11m]
(the size of this shard is 3.6GB)

I'm not sure this is related to my problem (node drop), but my gut
feeling
is I might need some tuning on the merge policy.

Any hints or guidance would be very much appreciated.
Thank you for advance.

--
View this message in context: http://elasticsearch-users.**
115913.n3.nabble.com/merge-**policy-tuning-hint-tp4026976.**htmlhttp://elasticsearch-users.115913.n3.nabble.com/merge-policy-tuning-hint-tp4026976.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--

--

--

Sorry for the double-post, and apologies in advance if posting attachments
to the group is verboten.

Here are graphs that show two values for the last week:

  • merges-current.jpg shows the number of current merges as reported by ES.
    Note that this number stayed low after midday Friday (when we "fixed"
    things the last time) through sometime Sunday, when things started to go
    downhill.

  • merges-total.jpg shows the rate of merges reported by ES, calculated by
    subtracting the total merges metric at repeating time intervals.

Lining these two graphs up, it looks like things are good Friday and
Saturday: we're merging quite a bit, but not that many are active at any
given time, which seems to indicate things are going fast.

Starting on Sunday, the number of concurrent merges starts climbing, and
the total number of merges completed drops way off. Our conclusion is that
merges have started taking longer for some reason, which leads to more
running at the same time.

You can disregard the changes at the end of the day Wednesday - that's when
we started fixing things by restarting nodes. If the previous pattern
holds, things will be OK again for a couple days.

We also have things set up with debug logging so that ES prints a message
like this when a merge completes and takes more than 20 seconds:

[2012-12-15 12:59:36,520][DEBUG][index.merge.scheduler ] [Matsu'o
Tsurayaba] [i26][22] merge [_6jeg] done, took [1.3m]

Picking a node at random, here are the number of slow merges on Saturday
(healthy), Monday (unhealthy, but still working) and Wednesday (going down
the drain):

Saturday: 7
Monday: 15
Wednesday: 22

Based on this information, the number of slow merges seems to be
increasing, while the total number of merges completed seems to be
decreasing.

Still, I/O and CPU seem reasonably healthy:

avg-cpu: %user %nice %system %iowait %steal %idle
23.84 0.00 1.90 1.19 0.00 73.07

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
avgqu-sz await svctm %util
sda 0.00 0.67 0.00 1.67 0.00 0.01 11.20
0.00 0.60 0.20 0.03
sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
sda3 0.00 0.67 0.00 1.67 0.00 0.01 11.20
0.00 0.60 0.20 0.03
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
sdc 0.00 1993.00 19.67 1207.67 0.15 12.47 21.07
4.40 3.18 0.19 22.93

On Wednesday, December 19, 2012 11:20:14 PM UTC-8, Jim Wong wrote:

Thanks for the response. We'll check out that link.

Can you elaborate on where you think the bottleneck is that's leading to
this problem?

Under normal circumstances (the first couple days after the cluster comes
up), there don't seem to be any obvious resource constraints arising from
the large number of shards: CPU utilization is very reasonable (perhaps
around 30%), and I/O utilization seems reasonable, as well (under 40%). GC
logging seems to indicate that we're not having issues on that front.
Everything seems fine and dandy for a couple days.

Over time, however, something seems to degrade: the number of merges the
system completes decreases dramatically, and the number of slow merges
increases (based on the debug logs ES produces). This is an interesting
period, because the cluster outwardly seems to be healthy, but we can tell
from the metrics we're collecting that things are starting go downhill. CPU
and disk I/O still seem fine, but for some reason there seems to be a
problem with merges.

Eventually - a few days after things start to slow down - the number of
threads in use skyrockets, the heap gets completely consumed, and nodes
start dropping out of the cluster. This is the end stage of the disease:
we've hit a tipping point and are dead.

We have some numbers (and maybe graphs) we might be able to share that
illustrate what's happening. To solve the problem, we'd like to understand
what resource we're low on before we jump into re-architecting the cluster,
lest we run into a similar problem again.

In the meantime, we are going to try adding another 32 GB of RAM to one of
our nodes while leaving the heap size unchanged to see if it changes the
pattern. I'm a little pessimistic on this front because I/O doesn't seem
to be a huge bottleneck right now (based on the output of iostat), but it
does seem possible that just fractionally improving merge performance
through additional buffer space could help if we are very slowly falling
behind.

On Wednesday, December 19, 2012 3:55:17 PM UTC-8, RKM wrote:

This would seem to be relevant:

Lucene mergeshttp://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

Elastic is simply building on top of lucene. You can "tweak" the merge
factor but to get good results but you will need to understand the
extensive tradeoffs involved. However, its a slight investment compared to
what you have done so far.

Still, looking at your system from a high-level I believe that merge
factors will be a band-aid and the only real relief is to re-architect such
that you have a "reasonable" number of shards per node. For example, if are
really needing 24gb of jvm heap then you probably need (at least) 48gb of
memory per node. Etc.

On Wed, Dec 19, 2012 at 8:30 AM, Jim Wong jim...@alumni.rice.edu wrote:

I work with Art and Ethan. A little background: we didn't target a
specific number of shards per server; instead we divided our user base into
partitions, and broke each partition up into a fixed number of shards, with
the built-in assumption that the indexes will get much, much bigger in the
future.

No, no single query hits every shard.

Just to restate some of the problem, things perform fine for a few days
after the cluster is started up. Then, we notice the number of current
merges (as reported by ES) start to grow until finally nodes start dropping
out and everything falls apart. We are not sure whether the current merges
metric is the cause of the problem or an effect.

The question we are wrestling with is, what changes between the first
2-3 days,when the number of merges seems relatively stable and the cluster
is performing great, and the next several days, when the number of merges
jumps up and everything falls apart.

During this time, heap usage in the JVM looks good and disk I/O looks
fine - the utilization percentage reported by iostat is under 50%. CPU
seems to increase with the number of active merges, but doesn't seem to max
out.

Thoughts? We're actually in a period right now in which the number of
merges has started to climb, but no one has dropped out of the cluster
(yet). The pattern is consistent enough that we can look at the merge
activity graphs and anticipate when the next failure will occur.

On Tuesday, December 18, 2012 9:24:11 PM UTC-8, Otis Gospodnetic wrote:

Hi,

I didn't carefully read this thread, but that seems like a high number
of shards/server. Does a single query hit all of them? If so, that's a
biiiig red flag.

How did you come up with "32 shards" number? Why not 22? Or 19?

A quick way of testing whether you have too many shards and if that's
the source of the problem is to create a test cluster somewhere, take
something like a Wikipedia dump, index it into an index with a high number
of shards and do a performance test while observing various performance
metrics (see my signature). Then drop the index and create a new one with
half the shards and repeat. This will give you the feel for how much
impact a high number of shards has on perf.

Otis

ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.**
html http://sematext.com/spm/index.html

On Tuesday, December 18, 2012 6:08:34 PM UTC-5, Ethan Shrago wrote:

Sorry, just to clarify...each index has only 32 shards with one
replica, but we have 250 shards per host.

Cluster in a nutshell:
Total combined index size: 23.4 TB

-19 data nodes

  • 79 indices
  • 32 shards + 1 replica shard = 5056 total shards in cluster
  • 5056 shareds / 19 data nodes = approx. 266 shards/node (we allocate
    shards evenly and randomly across the cluster, so any index will have
    shards that span across most all of the hosts in the cluster)

For each node:

16 cores (16xIntel(R) Xeon(R) CPU L5520 @ 2.27GHz)
32GB memory (8X 4096 MB/1333 MHz/DIMM/DDR3),

Local OS has 2x250GBx7200RPM in a mirrored RAID.

Separate RAID-10 for ES index data - Seagate 15K 450GB, total logical
drive size 2.7TB

Leading up the the node drop, we did receive ping responses, but since
they were past the timeout threshold we have set (120 seconds), they were
disregarded and the node got removed due to a failed heartbeat...this was
happening for a couple of hours leading up to the node dropping, so likely
what happened is that a couple of ping requests would timeout, but one
would get through before it timed out:

[2012-12-10 07:14:52,411][WARN ][transport ] [Farallah]
Received response for a request that has timed out, sent [267929ms] ago,
timed out [147928ms] ago, action [discovery/zen/fd/ping], node [[Mark
Gervaisnight shade][QHjSz782RqOACkEc5IRmCg]**[inet[/10.5.120.88:9300]]],
id [56024100]

[2012-12-10 11:45:44,488][WARN ][transport ] [Farallah]
Received response for a request that has timed out, sent [193187ms] ago,
timed out [73186ms] ago, action [discovery/zen/fd/ping], node [[Mark
Gervaisnight shade][QHjSz782RqOACkEc5IRmCg]**[inet[/10.5.120.88:9300]]],
id [57862298]

[2012-12-10 11:51:45,528][INFO ][cluster.service ] [Farallah]
removed {[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]],},
reason: zen-disco-node_failed([Mark Gervaisnight
shade][QHjSz782RqOACkEc5IRmCg]
[inet[/10.5.120.88:9300]]), reason
failed to ping, tried [3] times, each with maximum [2m] timeout

On Tuesday, December 18, 2012 2:23:16 PM UTC-8, RKM wrote:

Still not sure what each node looks like? Cores, memory, disc?

Can you succinctly summarize for the group what your cluster looks
like?

I understand that for logical reasons you have many indices, but why
does each index have 250 shards?

You saw that merging was happening around the time of the node
drop--but can you tell us why it dropped exactly? Did it fail to respond
to a heartbeat? Did the wrapper kill elasticsearch? There might be more
than one problem here.

Thanks,
Randy

On Thu, Dec 13, 2012 at 12:24 PM, arta arta...@sbcglobal.net wrote:

Hi,
I think I am seeing node performance degrades when the merge count
("merges":{"current": x, ...}} in node stats) increases to somewhere
6 to
10, and that results the node drop from the cluster.

I enabled index.merge DEBUG log then I see following logs for each
shard:

[2012-12-12 18:29:44,136][DEBUG][index.merge.policy ]
[Scourge of the
Underworld] [i11][25] using [tiered] merge policy with
expunge_deletes_allowed[10.0], floor_segment[2mb],
max_merge_at_once[10],
max_merge_at_once_explicit[30]
, max_merged_segment[5gb],
segments_per_tier[10.0], reclaim_deletes_weight[2.0],
async_merge[true]
[2012-12-12 18:29:44,136][DEBUG][index.**merge.scheduler ]
[Scourge of the
Underworld] [i11][25] using [concurrent] merge scheduler with
max_thread_count[3]

I'm using default merge policy on each node with 16 cores.
num_replicas is 1.
Each node has around 250 shards including primary and secondary
shards.
ES is 0.19.3.

This morning I encountered the node drop, and at that time, the node
dropped
was doing many merges, according to the log. The time used to each
merge
seems increasing when the node dropped.
It started with relatively short time like this:

[2012-12-13 07:43:07,235][DEBUG][index.**merge.scheduler ]
[Scourge of the
Underworld] [i13][27] merge [_72z9] done, took [54.8s]
(the size of this shard is 4.2GB)

Then the time increased to minutes (7:50 is the time when this node
dropped
from the cluster):

[2012-12-13 07:50:30,396][DEBUG][index.**merge.scheduler ]
[Scourge of the
Underworld] [i15][3] merge [_16sc3] done, took [4.8m]
(the size of this shard is 5.8GB)

It kept increasing:

[2012-12-13 07:57:17,056][DEBUG][index.**merge.scheduler ]
[Scourge of the
Underworld] [i531][22] merge [_6293] done, took [11m]
(the size of this shard is 3.6GB)

I'm not sure this is related to my problem (node drop), but my gut
feeling
is I might need some tuning on the merge policy.

Any hints or guidance would be very much appreciated.
Thank you for advance.

--
View this message in context: http://elasticsearch-users.**
115913.n3.nabble.com/merge-**policy-tuning-hint-tp4026976.**htmlhttp://elasticsearch-users.115913.n3.nabble.com/merge-policy-tuning-hint-tp4026976.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--

--

--

After all that, I forgot the attachments.

On Wednesday, December 19, 2012 11:47:06 PM UTC-8, Jim Wong wrote:

Sorry for the double-post, and apologies in advance if posting attachments
to the group is verboten.

Here are graphs that show two values for the last week:

  • merges-current.jpg shows the number of current merges as reported by ES.
    Note that this number stayed low after midday Friday (when we "fixed"
    things the last time) through sometime Sunday, when things started to go
    downhill.

  • merges-total.jpg shows the rate of merges reported by ES, calculated by
    subtracting the total merges metric at repeating time intervals.

Lining these two graphs up, it looks like things are good Friday and
Saturday: we're merging quite a bit, but not that many are active at any
given time, which seems to indicate things are going fast.

Starting on Sunday, the number of concurrent merges starts climbing, and
the total number of merges completed drops way off. Our conclusion is that
merges have started taking longer for some reason, which leads to more
running at the same time.

You can disregard the changes at the end of the day Wednesday - that's
when we started fixing things by restarting nodes. If the previous pattern
holds, things will be OK again for a couple days.

We also have things set up with debug logging so that ES prints a message
like this when a merge completes and takes more than 20 seconds:

[2012-12-15 12:59:36,520][DEBUG][index.merge.scheduler ] [Matsu'o
Tsurayaba] [i26][22] merge [_6jeg] done, took [1.3m]

Picking a node at random, here are the number of slow merges on Saturday
(healthy), Monday (unhealthy, but still working) and Wednesday (going down
the drain):

Saturday: 7
Monday: 15
Wednesday: 22

Based on this information, the number of slow merges seems to be
increasing, while the total number of merges completed seems to be
decreasing.

Still, I/O and CPU seem reasonably healthy:

avg-cpu: %user %nice %system %iowait %steal %idle
23.84 0.00 1.90 1.19 0.00 73.07

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
avgqu-sz await svctm %util
sda 0.00 0.67 0.00 1.67 0.00 0.01 11.20
0.00 0.60 0.20 0.03
sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
sda3 0.00 0.67 0.00 1.67 0.00 0.01 11.20
0.00 0.60 0.20 0.03
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
sdc 0.00 1993.00 19.67 1207.67 0.15 12.47 21.07
4.40 3.18 0.19 22.93

On Wednesday, December 19, 2012 11:20:14 PM UTC-8, Jim Wong wrote:

Thanks for the response. We'll check out that link.

Can you elaborate on where you think the bottleneck is that's leading to
this problem?

Under normal circumstances (the first couple days after the cluster comes
up), there don't seem to be any obvious resource constraints arising from
the large number of shards: CPU utilization is very reasonable (perhaps
around 30%), and I/O utilization seems reasonable, as well (under 40%). GC
logging seems to indicate that we're not having issues on that front.
Everything seems fine and dandy for a couple days.

Over time, however, something seems to degrade: the number of merges
the system completes decreases dramatically, and the number of slow merges
increases (based on the debug logs ES produces). This is an interesting
period, because the cluster outwardly seems to be healthy, but we can tell
from the metrics we're collecting that things are starting go downhill. CPU
and disk I/O still seem fine, but for some reason there seems to be a
problem with merges.

Eventually - a few days after things start to slow down - the number of
threads in use skyrockets, the heap gets completely consumed, and nodes
start dropping out of the cluster. This is the end stage of the disease:
we've hit a tipping point and are dead.

We have some numbers (and maybe graphs) we might be able to share that
illustrate what's happening. To solve the problem, we'd like to understand
what resource we're low on before we jump into re-architecting the cluster,
lest we run into a similar problem again.

In the meantime, we are going to try adding another 32 GB of RAM to one
of our nodes while leaving the heap size unchanged to see if it changes the
pattern. I'm a little pessimistic on this front because I/O doesn't seem
to be a huge bottleneck right now (based on the output of iostat), but it
does seem possible that just fractionally improving merge performance
through additional buffer space could help if we are very slowly falling
behind.

On Wednesday, December 19, 2012 3:55:17 PM UTC-8, RKM wrote:

This would seem to be relevant:

Lucene mergeshttp://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

Elastic is simply building on top of lucene. You can "tweak" the merge
factor but to get good results but you will need to understand the
extensive tradeoffs involved. However, its a slight investment compared to
what you have done so far.

Still, looking at your system from a high-level I believe that merge
factors will be a band-aid and the only real relief is to re-architect such
that you have a "reasonable" number of shards per node. For example, if are
really needing 24gb of jvm heap then you probably need (at least) 48gb of
memory per node. Etc.

On Wed, Dec 19, 2012 at 8:30 AM, Jim Wong jim...@alumni.rice.eduwrote:

I work with Art and Ethan. A little background: we didn't target a
specific number of shards per server; instead we divided our user base into
partitions, and broke each partition up into a fixed number of shards, with
the built-in assumption that the indexes will get much, much bigger in the
future.

No, no single query hits every shard.

Just to restate some of the problem, things perform fine for a few days
after the cluster is started up. Then, we notice the number of current
merges (as reported by ES) start to grow until finally nodes start dropping
out and everything falls apart. We are not sure whether the current merges
metric is the cause of the problem or an effect.

The question we are wrestling with is, what changes between the first
2-3 days,when the number of merges seems relatively stable and the cluster
is performing great, and the next several days, when the number of merges
jumps up and everything falls apart.

During this time, heap usage in the JVM looks good and disk I/O looks
fine - the utilization percentage reported by iostat is under 50%. CPU
seems to increase with the number of active merges, but doesn't seem to max
out.

Thoughts? We're actually in a period right now in which the number of
merges has started to climb, but no one has dropped out of the cluster
(yet). The pattern is consistent enough that we can look at the merge
activity graphs and anticipate when the next failure will occur.

On Tuesday, December 18, 2012 9:24:11 PM UTC-8, Otis Gospodnetic wrote:

Hi,

I didn't carefully read this thread, but that seems like a high number
of shards/server. Does a single query hit all of them? If so, that's a
biiiig red flag.

How did you come up with "32 shards" number? Why not 22? Or 19?

A quick way of testing whether you have too many shards and if that's
the source of the problem is to create a test cluster somewhere, take
something like a Wikipedia dump, index it into an index with a high number
of shards and do a performance test while observing various performance
metrics (see my signature). Then drop the index and create a new one with
half the shards and repeat. This will give you the feel for how much
impact a high number of shards has on perf.

Otis

ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.*
*html http://sematext.com/spm/index.html

On Tuesday, December 18, 2012 6:08:34 PM UTC-5, Ethan Shrago wrote:

Sorry, just to clarify...each index has only 32 shards with one
replica, but we have 250 shards per host.

Cluster in a nutshell:
Total combined index size: 23.4 TB

-19 data nodes

  • 79 indices
  • 32 shards + 1 replica shard = 5056 total shards in cluster
  • 5056 shareds / 19 data nodes = approx. 266 shards/node (we allocate
    shards evenly and randomly across the cluster, so any index will have
    shards that span across most all of the hosts in the cluster)

For each node:

16 cores (16xIntel(R) Xeon(R) CPU L5520 @ 2.27GHz)
32GB memory (8X 4096 MB/1333 MHz/DIMM/DDR3),

Local OS has 2x250GBx7200RPM in a mirrored RAID.

Separate RAID-10 for ES index data - Seagate 15K 450GB, total logical
drive size 2.7TB

Leading up the the node drop, we did receive ping responses, but
since they were past the timeout threshold we have set (120 seconds), they
were disregarded and the node got removed due to a failed heartbeat...this
was happening for a couple of hours leading up to the node dropping, so
likely what happened is that a couple of ping requests would timeout, but
one would get through before it timed out:

[2012-12-10 07:14:52,411][WARN ][transport ]
[Farallah] Received response for a request that has timed out, sent
[267929ms] ago, timed out [147928ms] ago, action [discovery/zen/fd/ping],
node [[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg]**[inet[/10.5.120.88:9300]]],
id [56024100]

[2012-12-10 11:45:44,488][WARN ][transport ]
[Farallah] Received response for a request that has timed out, sent
[193187ms] ago, timed out [73186ms] ago, action [discovery/zen/fd/ping],
node [[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg]**[inet[/10.5.120.88:9300]]],
id [57862298]

[2012-12-10 11:51:45,528][INFO ][cluster.service ]
[Farallah] removed {[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg]
[inet[/10.5.120.88:9300]],}, reason: zen-disco-node_failed([Mark
Gervaisnight shade][QHjSz782RqOACkEc5IRmCg]
[inet[/10.5.120.88:9300]]),
reason failed to ping, tried [3] times, each with maximum [2m] timeout

On Tuesday, December 18, 2012 2:23:16 PM UTC-8, RKM wrote:

Still not sure what each node looks like? Cores, memory, disc?

Can you succinctly summarize for the group what your cluster looks
like?

I understand that for logical reasons you have many indices, but why
does each index have 250 shards?

You saw that merging was happening around the time of the node
drop--but can you tell us why it dropped exactly? Did it fail to respond
to a heartbeat? Did the wrapper kill elasticsearch? There might be more
than one problem here.

Thanks,
Randy

On Thu, Dec 13, 2012 at 12:24 PM, arta arta...@sbcglobal.netwrote:

Hi,
I think I am seeing node performance degrades when the merge count
("merges":{"current": x, ...}} in node stats) increases to
somewhere 6 to
10, and that results the node drop from the cluster.

I enabled index.merge DEBUG log then I see following logs for each
shard:

[2012-12-12 18:29:44,136][DEBUG][index.merge.policy ]
[Scourge of the
Underworld] [i11][25] using [tiered] merge policy with
expunge_deletes_allowed[10.0], floor_segment[2mb],
max_merge_at_once[10],
max_merge_at_once_explicit[30]
, max_merged_segment[5gb],
segments_per_tier[10.0], reclaim_deletes_weight[2.0],
async_merge[true]
[2012-12-12 18:29:44,136][DEBUG][index.**merge.scheduler ]
[Scourge of the
Underworld] [i11][25] using [concurrent] merge scheduler with
max_thread_count[3]

I'm using default merge policy on each node with 16 cores.
num_replicas is 1.
Each node has around 250 shards including primary and secondary
shards.
ES is 0.19.3.

This morning I encountered the node drop, and at that time, the
node dropped
was doing many merges, according to the log. The time used to each
merge
seems increasing when the node dropped.
It started with relatively short time like this:

[2012-12-13 07:43:07,235][DEBUG][index.**merge.scheduler ]
[Scourge of the
Underworld] [i13][27] merge [_72z9] done, took [54.8s]
(the size of this shard is 4.2GB)

Then the time increased to minutes (7:50 is the time when this node
dropped
from the cluster):

[2012-12-13 07:50:30,396][DEBUG][index.**merge.scheduler ]
[Scourge of the
Underworld] [i15][3] merge [_16sc3] done, took [4.8m]
(the size of this shard is 5.8GB)

It kept increasing:

[2012-12-13 07:57:17,056][DEBUG][index.**merge.scheduler ]
[Scourge of the
Underworld] [i531][22] merge [_6293] done, took [11m]
(the size of this shard is 3.6GB)

I'm not sure this is related to my problem (node drop), but my gut
feeling
is I might need some tuning on the merge policy.

Any hints or guidance would be very much appreciated.
Thank you for advance.

--
View this message in context: http://elasticsearch-users.**
115913.n3.nabble.com/merge-**policy-tuning-hint-tp4026976.**htmlhttp://elasticsearch-users.115913.n3.nabble.com/merge-policy-tuning-hint-tp4026976.html
Sent from the Elasticsearch Users mailing list archive at
Nabble.com.

--

--

--

Hi Jim,
Nice graphs. It seems what we are missing, though, is the indexing volume.
Presumably if you aren't adding to the index merges would not be happening
at all. Is the indexing more or less constant over the time period shown?
Is it constant with respect to the different indices, as well?

If constant across indices that would seem to indicate that your bigger
merges are simply taking longer. (Again, look at the McCandless blog for
why this is).

Since you have so many indices have you tried putting them all on different
merge schedules? It seems like that would be trivial by one or more of: a)
setting different merge policy; b) setting different
index.merge.policy.segments_per_tier values c) optimizing segments at
various time for various indices. I've set different segments_per_tier from
3 to 50 depending on whether I was trying to get the best query or indexing
performance. So experiment. My current working theory is that your cluster
dies since 79*32 indices are all trying to merge around the same time.

So two concrete things to try:

  1. different merge schedules for all indices (should at least change that
    graph if this is the problem)
  2. fewer shards per index (try one or two, instead of 32).

I've been poking at memory because in my elastic system and all of the ones
I've simply read about on this newsgroup it seems to be a scarce resource:
the OS needs copious amounts to perform lucene I/O and elastic needs lots
to do its thing. I ended up with twice as much memory as we had initially
planned for in our cluster, for example.

Good luck,
Randy

On Wed, Dec 19, 2012 at 11:47 PM, Jim Wong jimwong@alumni.rice.edu wrote:

After all that, I forgot the attachments.

On Wednesday, December 19, 2012 11:47:06 PM UTC-8, Jim Wong wrote:

Sorry for the double-post, and apologies in advance if posting
attachments to the group is verboten.

Here are graphs that show two values for the last week:

  • merges-current.jpg shows the number of current merges as reported by
    ES. Note that this number stayed low after midday Friday (when we "fixed"
    things the last time) through sometime Sunday, when things started to go
    downhill.

  • merges-total.jpg shows the rate of merges reported by ES, calculated by
    subtracting the total merges metric at repeating time intervals.

Lining these two graphs up, it looks like things are good Friday and
Saturday: we're merging quite a bit, but not that many are active at any
given time, which seems to indicate things are going fast.

Starting on Sunday, the number of concurrent merges starts climbing, and
the total number of merges completed drops way off. Our conclusion is that
merges have started taking longer for some reason, which leads to more
running at the same time.

You can disregard the changes at the end of the day Wednesday - that's
when we started fixing things by restarting nodes. If the previous pattern
holds, things will be OK again for a couple days.

We also have things set up with debug logging so that ES prints a message
like this when a merge completes and takes more than 20 seconds:

[2012-12-15 12:59:36,520][DEBUG][index.**merge.scheduler ] [Matsu'o
Tsurayaba] [i26][22] merge [_6jeg] done, took [1.3m]

Picking a node at random, here are the number of slow merges on Saturday
(healthy), Monday (unhealthy, but still working) and Wednesday (going down
the drain):

Saturday: 7
Monday: 15
Wednesday: 22

Based on this information, the number of slow merges seems to be
increasing, while the total number of merges completed seems to be
decreasing.

Still, I/O and CPU seem reasonably healthy:

avg-cpu: %user %nice %system %iowait %steal %idle
23.84 0.00 1.90 1.19 0.00 73.07

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
avgqu-sz await svctm %util
sda 0.00 0.67 0.00 1.67 0.00 0.01 11.20
0.00 0.60 0.20 0.03
sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
sda3 0.00 0.67 0.00 1.67 0.00 0.01 11.20
0.00 0.60 0.20 0.03
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
sdc 0.00 1993.00 19.67 1207.67 0.15 12.47 21.07
4.40 3.18 0.19 22.93

On Wednesday, December 19, 2012 11:20:14 PM UTC-8, Jim Wong wrote:

Thanks for the response. We'll check out that link.

Can you elaborate on where you think the bottleneck is that's leading to
this problem?

Under normal circumstances (the first couple days after the cluster
comes up), there don't seem to be any obvious resource constraints arising
from the large number of shards: CPU utilization is very reasonable
(perhaps around 30%), and I/O utilization seems reasonable, as well (under
40%). GC logging seems to indicate that we're not having issues on that
front. Everything seems fine and dandy for a couple days.

Over time, however, something seems to degrade: the number of merges
the system completes decreases dramatically, and the number of slow merges
increases (based on the debug logs ES produces). This is an interesting
period, because the cluster outwardly seems to be healthy, but we can tell
from the metrics we're collecting that things are starting go downhill. CPU
and disk I/O still seem fine, but for some reason there seems to be a
problem with merges.

Eventually - a few days after things start to slow down - the number of
threads in use skyrockets, the heap gets completely consumed, and nodes
start dropping out of the cluster. This is the end stage of the disease:
we've hit a tipping point and are dead.

We have some numbers (and maybe graphs) we might be able to share that
illustrate what's happening. To solve the problem, we'd like to understand
what resource we're low on before we jump into re-architecting the cluster,
lest we run into a similar problem again.

In the meantime, we are going to try adding another 32 GB of RAM to one
of our nodes while leaving the heap size unchanged to see if it changes the
pattern. I'm a little pessimistic on this front because I/O doesn't seem
to be a huge bottleneck right now (based on the output of iostat), but it
does seem possible that just fractionally improving merge performance
through additional buffer space could help if we are very slowly falling
behind.

On Wednesday, December 19, 2012 3:55:17 PM UTC-8, RKM wrote:

This would seem to be relevant:

Lucene mergeshttp://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

Elastic is simply building on top of lucene. You can "tweak" the merge
factor but to get good results but you will need to understand the
extensive tradeoffs involved. However, its a slight investment compared to
what you have done so far.

Still, looking at your system from a high-level I believe that merge
factors will be a band-aid and the only real relief is to re-architect such
that you have a "reasonable" number of shards per node. For example, if are
really needing 24gb of jvm heap then you probably need (at least) 48gb of
memory per node. Etc.

On Wed, Dec 19, 2012 at 8:30 AM, Jim Wong jim...@alumni.rice.eduwrote:

I work with Art and Ethan. A little background: we didn't target a
specific number of shards per server; instead we divided our user base into
partitions, and broke each partition up into a fixed number of shards, with
the built-in assumption that the indexes will get much, much bigger in the
future.

No, no single query hits every shard.

Just to restate some of the problem, things perform fine for a few
days after the cluster is started up. Then, we notice the number of
current merges (as reported by ES) start to grow until finally nodes start
dropping out and everything falls apart. We are not sure whether the
current merges metric is the cause of the problem or an effect.

The question we are wrestling with is, what changes between the first
2-3 days,when the number of merges seems relatively stable and the cluster
is performing great, and the next several days, when the number of merges
jumps up and everything falls apart.

During this time, heap usage in the JVM looks good and disk I/O looks
fine - the utilization percentage reported by iostat is under 50%. CPU
seems to increase with the number of active merges, but doesn't seem to max
out.

Thoughts? We're actually in a period right now in which the number of
merges has started to climb, but no one has dropped out of the cluster
(yet). The pattern is consistent enough that we can look at the merge
activity graphs and anticipate when the next failure will occur.

On Tuesday, December 18, 2012 9:24:11 PM UTC-8, Otis Gospodnetic wrote:

Hi,

I didn't carefully read this thread, but that seems like a high
number of shards/server. Does a single query hit all of them? If so,
that's a biiiig red flag.

How did you come up with "32 shards" number? Why not 22? Or 19?

A quick way of testing whether you have too many shards and if that's
the source of the problem is to create a test cluster somewhere, take
something like a Wikipedia dump, index it into an index with a high number
of shards and do a performance test while observing various performance
metrics (see my signature). Then drop the index and create a new one with
half the shards and repeat. This will give you the feel for how much
impact a high number of shards has on perf.

Otis

ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.
****html http://sematext.com/spm/index.html

On Tuesday, December 18, 2012 6:08:34 PM UTC-5, Ethan Shrago wrote:

Sorry, just to clarify...each index has only 32 shards with one
replica, but we have 250 shards per host.

Cluster in a nutshell:
Total combined index size: 23.4 TB

-19 data nodes

  • 79 indices
  • 32 shards + 1 replica shard = 5056 total shards in cluster
  • 5056 shareds / 19 data nodes = approx. 266 shards/node (we
    allocate shards evenly and randomly across the cluster, so any index will
    have shards that span across most all of the hosts in the cluster)

For each node:

16 cores (16xIntel(R) Xeon(R) CPU L5520 @ 2.27GHz)
32GB memory (8X 4096 MB/1333 MHz/DIMM/DDR3),

Local OS has 2x250GBx7200RPM in a mirrored RAID.

Separate RAID-10 for ES index data - Seagate 15K 450GB, total
logical drive size 2.7TB

Leading up the the node drop, we did receive ping responses, but
since they were past the timeout threshold we have set (120 seconds), they
were disregarded and the node got removed due to a failed heartbeat...this
was happening for a couple of hours leading up to the node dropping, so
likely what happened is that a couple of ping requests would timeout, but
one would get through before it timed out:

[2012-12-10 07:14:52,411][WARN ][transport ]
[Farallah] Received response for a request that has timed out, sent
[267929ms] ago, timed out [147928ms] ago, action [discovery/zen/fd/ping],
node [[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg]****
[inet[/10.5.120.88:9300]]], id [56024100]

[2012-12-10 11:45:44,488][WARN ][transport ]
[Farallah] Received response for a request that has timed out, sent
[193187ms] ago, timed out [73186ms] ago, action [discovery/zen/fd/ping],
node [[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg]****
[inet[/10.5.120.88:9300]]], id [57862298]

[2012-12-10 11:51:45,528][INFO ][cluster.service ]
[Farallah] removed {[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg]
****[inet[/10.5.120.88:9300]],}, reason:
zen-disco-node_failed([Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg]
****[inet[/10.5.120.88:9300]]), reason failed to ping, tried [3]
times, each with maximum [2m] timeout

On Tuesday, December 18, 2012 2:23:16 PM UTC-8, RKM wrote:

Still not sure what each node looks like? Cores, memory, disc?

Can you succinctly summarize for the group what your cluster looks
like?

I understand that for logical reasons you have many indices, but
why does each index have 250 shards?

You saw that merging was happening around the time of the node
drop--but can you tell us why it dropped exactly? Did it fail to respond
to a heartbeat? Did the wrapper kill elasticsearch? There might be more
than one problem here.

Thanks,
Randy

On Thu, Dec 13, 2012 at 12:24 PM, arta arta...@sbcglobal.netwrote:

Hi,
I think I am seeing node performance degrades when the merge count
("merges":{"current": x, ...}} in node stats) increases to
somewhere 6 to
10, and that results the node drop from the cluster.

I enabled index.merge DEBUG log then I see following logs for each
shard:

[2012-12-12 18:29:44,136][DEBUG][index.merge.policy ]
[Scourge of the
Underworld] [i11][25] using [tiered] merge policy with
expunge_deletes_allowed[10.0], floor_segment[2mb],
max_merge_at_once[10],
max_merge_at_once_explicit[30]****, max_merged_segment[5gb],
segments_per_tier[10.0], reclaim_deletes_weight[2.0],
async_merge[true]
[2012-12-12 18:29:44,136][DEBUG][index.merge.scheduler ]
[Scourge of the
Underworld] [i11][25] using [concurrent] merge scheduler with
max_thread_count[3]

I'm using default merge policy on each node with 16 cores.
num_replicas is 1.
Each node has around 250 shards including primary and secondary
shards.
ES is 0.19.3.

This morning I encountered the node drop, and at that time, the
node dropped
was doing many merges, according to the log. The time used to each
merge
seems increasing when the node dropped.
It started with relatively short time like this:

[2012-12-13 07:43:07,235][DEBUG][index.merge.scheduler ]
[Scourge of the
Underworld] [i13][27] merge [_72z9] done, took [54.8s]
(the size of this shard is 4.2GB)

Then the time increased to minutes (7:50 is the time when this
node dropped
from the cluster):

[2012-12-13 07:50:30,396][DEBUG][index.merge.scheduler ]
[Scourge of the
Underworld] [i15][3] merge [_16sc3] done, took [4.8m]
(the size of this shard is 5.8GB)

It kept increasing:

[2012-12-13 07:57:17,056][DEBUG][index.merge.scheduler ]
[Scourge of the
Underworld] [i531][22] merge [_6293] done, took [11m]
(the size of this shard is 3.6GB)

I'm not sure this is related to my problem (node drop), but my gut
feeling
is I might need some tuning on the merge policy.

Any hints or guidance would be very much appreciated.
Thank you for advance.

--
View this message in context: http://elasticsearch-users.115
913.n3.nabble.com/merge-policy-tuning-hint-tp4026976.**htmlhttp://elasticsearch-users.115913.n3.nabble.com/merge-policy-tuning-hint-tp4026976.html
Sent from the Elasticsearch Users mailing list archive at
Nabble.com.

--

--

--

--