Sorry for the double-post, and apologies in advance if posting attachments
to the group is verboten.
Here are graphs that show two values for the last week:
-
merges-current.jpg shows the number of current merges as reported by ES.
Note that this number stayed low after midday Friday (when we "fixed"
things the last time) through sometime Sunday, when things started to go
downhill.
-
merges-total.jpg shows the rate of merges reported by ES, calculated by
subtracting the total merges metric at repeating time intervals.
Lining these two graphs up, it looks like things are good Friday and
Saturday: we're merging quite a bit, but not that many are active at any
given time, which seems to indicate things are going fast.
Starting on Sunday, the number of concurrent merges starts climbing, and
the total number of merges completed drops way off. Our conclusion is that
merges have started taking longer for some reason, which leads to more
running at the same time.
You can disregard the changes at the end of the day Wednesday - that's when
we started fixing things by restarting nodes. If the previous pattern
holds, things will be OK again for a couple days.
We also have things set up with debug logging so that ES prints a message
like this when a merge completes and takes more than 20 seconds:
[2012-12-15 12:59:36,520][DEBUG][index.merge.scheduler ] [Matsu'o
Tsurayaba] [i26][22] merge [_6jeg] done, took [1.3m]
Picking a node at random, here are the number of slow merges on Saturday
(healthy), Monday (unhealthy, but still working) and Wednesday (going down
the drain):
Saturday: 7
Monday: 15
Wednesday: 22
Based on this information, the number of slow merges seems to be
increasing, while the total number of merges completed seems to be
decreasing.
Still, I/O and CPU seem reasonably healthy:
avg-cpu: %user %nice %system %iowait %steal %idle
23.84 0.00 1.90 1.19 0.00 73.07
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
avgqu-sz await svctm %util
sda 0.00 0.67 0.00 1.67 0.00 0.01 11.20
0.00 0.60 0.20 0.03
sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
sda3 0.00 0.67 0.00 1.67 0.00 0.01 11.20
0.00 0.60 0.20 0.03
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
sdc 0.00 1993.00 19.67 1207.67 0.15 12.47 21.07
4.40 3.18 0.19 22.93
On Wednesday, December 19, 2012 11:20:14 PM UTC-8, Jim Wong wrote:
Thanks for the response. We'll check out that link.
Can you elaborate on where you think the bottleneck is that's leading to
this problem?
Under normal circumstances (the first couple days after the cluster comes
up), there don't seem to be any obvious resource constraints arising from
the large number of shards: CPU utilization is very reasonable (perhaps
around 30%), and I/O utilization seems reasonable, as well (under 40%). GC
logging seems to indicate that we're not having issues on that front.
Everything seems fine and dandy for a couple days.
Over time, however, something seems to degrade: the number of merges the
system completes decreases dramatically, and the number of slow merges
increases (based on the debug logs ES produces). This is an interesting
period, because the cluster outwardly seems to be healthy, but we can tell
from the metrics we're collecting that things are starting go downhill. CPU
and disk I/O still seem fine, but for some reason there seems to be a
problem with merges.
Eventually - a few days after things start to slow down - the number of
threads in use skyrockets, the heap gets completely consumed, and nodes
start dropping out of the cluster. This is the end stage of the disease:
we've hit a tipping point and are dead.
We have some numbers (and maybe graphs) we might be able to share that
illustrate what's happening. To solve the problem, we'd like to understand
what resource we're low on before we jump into re-architecting the cluster,
lest we run into a similar problem again.
In the meantime, we are going to try adding another 32 GB of RAM to one of
our nodes while leaving the heap size unchanged to see if it changes the
pattern. I'm a little pessimistic on this front because I/O doesn't seem
to be a huge bottleneck right now (based on the output of iostat), but it
does seem possible that just fractionally improving merge performance
through additional buffer space could help if we are very slowly falling
behind.
On Wednesday, December 19, 2012 3:55:17 PM UTC-8, RKM wrote:
This would seem to be relevant:
Lucene mergeshttp://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
Elastic is simply building on top of lucene. You can "tweak" the merge
factor but to get good results but you will need to understand the
extensive tradeoffs involved. However, its a slight investment compared to
what you have done so far.
Still, looking at your system from a high-level I believe that merge
factors will be a band-aid and the only real relief is to re-architect such
that you have a "reasonable" number of shards per node. For example, if are
really needing 24gb of jvm heap then you probably need (at least) 48gb of
memory per node. Etc.
On Wed, Dec 19, 2012 at 8:30 AM, Jim Wong jim...@alumni.rice.edu wrote:
I work with Art and Ethan. A little background: we didn't target a
specific number of shards per server; instead we divided our user base into
partitions, and broke each partition up into a fixed number of shards, with
the built-in assumption that the indexes will get much, much bigger in the
future.
No, no single query hits every shard.
Just to restate some of the problem, things perform fine for a few days
after the cluster is started up. Then, we notice the number of current
merges (as reported by ES) start to grow until finally nodes start dropping
out and everything falls apart. We are not sure whether the current merges
metric is the cause of the problem or an effect.
The question we are wrestling with is, what changes between the first
2-3 days,when the number of merges seems relatively stable and the cluster
is performing great, and the next several days, when the number of merges
jumps up and everything falls apart.
During this time, heap usage in the JVM looks good and disk I/O looks
fine - the utilization percentage reported by iostat is under 50%. CPU
seems to increase with the number of active merges, but doesn't seem to max
out.
Thoughts? We're actually in a period right now in which the number of
merges has started to climb, but no one has dropped out of the cluster
(yet). The pattern is consistent enough that we can look at the merge
activity graphs and anticipate when the next failure will occur.
On Tuesday, December 18, 2012 9:24:11 PM UTC-8, Otis Gospodnetic wrote:
Hi,
I didn't carefully read this thread, but that seems like a high number
of shards/server. Does a single query hit all of them? If so, that's a
biiiig red flag.
How did you come up with "32 shards" number? Why not 22? Or 19?
A quick way of testing whether you have too many shards and if that's
the source of the problem is to create a test cluster somewhere, take
something like a Wikipedia dump, index it into an index with a high number
of shards and do a performance test while observing various performance
metrics (see my signature). Then drop the index and create a new one with
half the shards and repeat. This will give you the feel for how much
impact a high number of shards has on perf.
Otis
ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.**
html http://sematext.com/spm/index.html
On Tuesday, December 18, 2012 6:08:34 PM UTC-5, Ethan Shrago wrote:
Sorry, just to clarify...each index has only 32 shards with one
replica, but we have 250 shards per host.
Cluster in a nutshell:
Total combined index size: 23.4 TB
-19 data nodes
- 79 indices
- 32 shards + 1 replica shard = 5056 total shards in cluster
- 5056 shareds / 19 data nodes = approx. 266 shards/node (we allocate
shards evenly and randomly across the cluster, so any index will have
shards that span across most all of the hosts in the cluster)
For each node:
16 cores (16xIntel(R) Xeon(R) CPU L5520 @ 2.27GHz)
32GB memory (8X 4096 MB/1333 MHz/DIMM/DDR3),
Local OS has 2x250GBx7200RPM in a mirrored RAID.
Separate RAID-10 for ES index data - Seagate 15K 450GB, total logical
drive size 2.7TB
Leading up the the node drop, we did receive ping responses, but since
they were past the timeout threshold we have set (120 seconds), they were
disregarded and the node got removed due to a failed heartbeat...this was
happening for a couple of hours leading up to the node dropping, so likely
what happened is that a couple of ping requests would timeout, but one
would get through before it timed out:
[2012-12-10 07:14:52,411][WARN ][transport ] [Farallah]
Received response for a request that has timed out, sent [267929ms] ago,
timed out [147928ms] ago, action [discovery/zen/fd/ping], node [[Mark
Gervaisnight shade][QHjSz782RqOACkEc5IRmCg]**[inet[/10.5.120.88:9300]]],
id [56024100]
[2012-12-10 11:45:44,488][WARN ][transport ] [Farallah]
Received response for a request that has timed out, sent [193187ms] ago,
timed out [73186ms] ago, action [discovery/zen/fd/ping], node [[Mark
Gervaisnight shade][QHjSz782RqOACkEc5IRmCg]**[inet[/10.5.120.88:9300]]],
id [57862298]
[2012-12-10 11:51:45,528][INFO ][cluster.service ] [Farallah]
removed {[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]],},
reason: zen-disco-node_failed([Mark Gervaisnight
shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]), reason
failed to ping, tried [3] times, each with maximum [2m] timeout
On Tuesday, December 18, 2012 2:23:16 PM UTC-8, RKM wrote:
Still not sure what each node looks like? Cores, memory, disc?
Can you succinctly summarize for the group what your cluster looks
like?
I understand that for logical reasons you have many indices, but why
does each index have 250 shards?
You saw that merging was happening around the time of the node
drop--but can you tell us why it dropped exactly? Did it fail to respond
to a heartbeat? Did the wrapper kill elasticsearch? There might be more
than one problem here.
Thanks,
Randy
On Thu, Dec 13, 2012 at 12:24 PM, arta arta...@sbcglobal.net wrote:
Hi,
I think I am seeing node performance degrades when the merge count
("merges":{"current": x, ...}} in node stats) increases to somewhere
6 to
10, and that results the node drop from the cluster.
I enabled index.merge DEBUG log then I see following logs for each
shard:
[2012-12-12 18:29:44,136][DEBUG][index.merge.policy ]
[Scourge of the
Underworld] [i11][25] using [tiered] merge policy with
expunge_deletes_allowed[10.0], floor_segment[2mb],
max_merge_at_once[10],
max_merge_at_once_explicit[30], max_merged_segment[5gb],
segments_per_tier[10.0], reclaim_deletes_weight[2.0],
async_merge[true]
[2012-12-12 18:29:44,136][DEBUG][index.**merge.scheduler ]
[Scourge of the
Underworld] [i11][25] using [concurrent] merge scheduler with
max_thread_count[3]
I'm using default merge policy on each node with 16 cores.
num_replicas is 1.
Each node has around 250 shards including primary and secondary
shards.
ES is 0.19.3.
This morning I encountered the node drop, and at that time, the node
dropped
was doing many merges, according to the log. The time used to each
merge
seems increasing when the node dropped.
It started with relatively short time like this:
[2012-12-13 07:43:07,235][DEBUG][index.**merge.scheduler ]
[Scourge of the
Underworld] [i13][27] merge [_72z9] done, took [54.8s]
(the size of this shard is 4.2GB)
Then the time increased to minutes (7:50 is the time when this node
dropped
from the cluster):
[2012-12-13 07:50:30,396][DEBUG][index.**merge.scheduler ]
[Scourge of the
Underworld] [i15][3] merge [_16sc3] done, took [4.8m]
(the size of this shard is 5.8GB)
It kept increasing:
[2012-12-13 07:57:17,056][DEBUG][index.**merge.scheduler ]
[Scourge of the
Underworld] [i531][22] merge [_6293] done, took [11m]
(the size of this shard is 3.6GB)
I'm not sure this is related to my problem (node drop), but my gut
feeling
is I might need some tuning on the merge policy.
Any hints or guidance would be very much appreciated.
Thank you for advance.
--
View this message in context: http://elasticsearch-users.**
115913.n3.nabble.com/merge-**policy-tuning-hint-tp4026976.**htmlhttp://elasticsearch-users.115913.n3.nabble.com/merge-policy-tuning-hint-tp4026976.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.
--
--
--