ES instability


(Klavs Klavsen) #1

Hi guys,

I've got an ES cluster of two data nodes and one no-data node (serving the
kibana website). It receives approx. 40 mio. loglines a day, and normally
has no issue with this.
If I stop reading in for a short time - and start again -the queue is
emptied about 50x faster than it is filled.

We've had several different issues, and have fixed up nprocs and tuned
elasticsearch.yml - which have helped, but ES (since 1.1.2 - which might
be a coincidence though) suddenly gets an immense slowdown - which makes
the queue fill up. If I then stop everything and restart ES, then LS - it
usually picks back up. Sometimes I have to do it several times.

The only thing that seems to increase in elasticsearch logs, around when
this happens is this message:
[2014-06-22 20:23:02,612][WARN ][transport ]
[p-elasticlog02] Received response for a request that has timed out, sent
[44943ms] ago, timed out [14943ms] ago, action
[discovery/zen/fd/masterPing], node
[[p-elasticlog03][JlyflI1AT6WJHh5fsk311w][p-elasticlog03.example.dk][inet[/10.223.156.18:9300]]{master=true}],
id [23927]

in the second node in the cluster (which seemed to be the cause)
there was GC messages.. and I had to bring down the entire cluster to make
it start running properly again ( I could perhaps just have restarted the
node writing about gc).

I've set nprocs to 4096 and max open files to 65k.

ES is started with: /usr/bin/java -Xms41886M -Xmx41886M
-XX:MaxDirectMemorySize=41886M -Xss256k -Djava.awt.headless=true
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/lib/elasticsearch/heapdump.hprof -Delasticsearch
-Des.pidfile=/var/run/elasticsearch/elasticsearch.pid
-Des.path.home=/usr/share/elasticsearch -cp
:/usr/share/elasticsearch/lib/elasticsearch-1.1.2.jar:/usr/share/elasticsearch/lib/:/usr/share/elasticsearch/lib/sigar/
-Des.default.path.home=/usr/share/elasticsearch
-Des.default.path.logs=/var/log/elasticsearch
-Des.default.path.data=/var/lib/elasticsearch
-Des.default.path.work=/tmp/elasticsearch
-Des.default.path.conf=/etc/elasticsearch
org.elasticsearch.bootstrap.Elasticsearch

Any recommendations as to how I can make try to fix this problem? It
happens a few times a week :frowning:

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/70c87756-f9b8-4032-9906-9a520c28801e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Mark Walkom) #2

It sounds like you are running into GC problems, which is inevitable when
your cluster is at capacity. A few things;

You're running java with a >32GB heap, which will mean your pointers are no
longer compressed and this can/will adversely impact GC.
What ES version are you on, what java version and release, what are your
node specs, how many indexes and how large are they?
Make sure you're monitoring your cluster using plugins like ElasticHQ or
Marvel to give you insight into what is happening.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 23 June 2014 04:44, Klavs Klavsen klavs@enableit.dk wrote:

Hi guys,

I've got an ES cluster of two data nodes and one no-data node (serving the
kibana website). It receives approx. 40 mio. loglines a day, and normally
has no issue with this.
If I stop reading in for a short time - and start again -the queue is
emptied about 50x faster than it is filled.

We've had several different issues, and have fixed up nprocs and tuned
elasticsearch.yml - which have helped, but ES (since 1.1.2 - which might
be a coincidence though) suddenly gets an immense slowdown - which makes
the queue fill up. If I then stop everything and restart ES, then LS - it
usually picks back up. Sometimes I have to do it several times.

The only thing that seems to increase in elasticsearch logs, around when
this happens is this message:
[2014-06-22 20:23:02,612][WARN ][transport ]
[p-elasticlog02] Received response for a request that has timed out, sent
[44943ms] ago, timed out [14943ms] ago, action
[discovery/zen/fd/masterPing], node
[[p-elasticlog03][JlyflI1AT6WJHh5fsk311w][p-elasticlog03.example.dk
][inet[/10.223.156.18:9300]]{master=true}], id [23927]

in the second node in the cluster (which seemed to be the cause)
there was GC messages.. and I had to bring down the entire cluster to make
it start running properly again ( I could perhaps just have restarted the
node writing about gc).

I've set nprocs to 4096 and max open files to 65k.

ES is started with: /usr/bin/java -Xms41886M -Xmx41886M
-XX:MaxDirectMemorySize=41886M -Xss256k -Djava.awt.headless=true
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/lib/elasticsearch/heapdump.hprof -Delasticsearch
-Des.pidfile=/var/run/elasticsearch/elasticsearch.pid
-Des.path.home=/usr/share/elasticsearch -cp
:/usr/share/elasticsearch/lib/elasticsearch-1.1.2.jar:/usr/share/elasticsearch/lib/:/usr/share/elasticsearch/lib/sigar/
-Des.default.path.home=/usr/share/elasticsearch
-Des.default.path.logs=/var/log/elasticsearch
-Des.default.path.data=/var/lib/elasticsearch
-Des.default.path.work=/tmp/elasticsearch
-Des.default.path.conf=/etc/elasticsearch
org.elasticsearch.bootstrap.Elasticsearch

Any recommendations as to how I can make try to fix this problem? It
happens a few times a week :frowning:

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/70c87756-f9b8-4032-9906-9a520c28801e%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/70c87756-f9b8-4032-9906-9a520c28801e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624ZrtV26FMysaq3iYfJGLgoNscMW04dwsq4%3Dvy9TU1sFwg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Klavs Klavsen) #3

https://lh3.googleusercontent.com/-K0dc3LcoH_s/U6gG_8Q7D2I/AAAAAAAAAE8/W5nKfBfAu24/s1600/ES-graphs.png

ES v1.1.2
openjdk 1.7.0_55
my nodes are 24 core, 64GB memory with SSD disk setup as bcache with
writeback and 8 SATA disks in raid 6 in the back of that.
I have almost no io-wait.

I'm pulling stats using diamond - and have a dashboard setup in grafana -
image attached of when it happened last. The one thing I can see, is that
it happens when ES is at ~37GB Heap usage.

Any numbers in particular I should be looking for?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e61e25b9-5579-470e-bbe8-9c68bf105645%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Mark Walkom) #4

How much data do you have in the cluster, index count and total size?

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 23 June 2014 20:53, Klavs Klavsen klavs@enableit.dk wrote:

https://lh3.googleusercontent.com/-K0dc3LcoH_s/U6gG_8Q7D2I/AAAAAAAAAE8/W5nKfBfAu24/s1600/ES-graphs.png

ES v1.1.2
openjdk 1.7.0_55
my nodes are 24 core, 64GB memory with SSD disk setup as bcache with
writeback and 8 SATA disks in raid 6 in the back of that.
I have almost no io-wait.

I'm pulling stats using diamond - and have a dashboard setup in grafana -
image attached of when it happened last. The one thing I can see, is that
it happens when ES is at ~37GB Heap usage.

Any numbers in particular I should be looking for?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/e61e25b9-5579-470e-bbe8-9c68bf105645%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/e61e25b9-5579-470e-bbe8-9c68bf105645%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624ZAY-tUSTSeB5j3WO6WXYB%2BvWG-R4HhfCrueJCBXPageA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Klavs Klavsen) #5

https://lh4.googleusercontent.com/--R2XitJ2QVg/U6gILB_bwoI/AAAAAAAAAFI/OkdS3_Sf_xE/s1600/ES-graphs1.png
wrong.. that's not the image of when it happened last.. I'm currently
working on fixing zoom issues on grafana - so can't give detailed zoom yet.

Here's the last 24h - it happened at approx. 16:00 yesterday - and I fixed
it by restarting the cluster at 20:00 (I could have probably just restarted
the node that had written something about GC in its logs..)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f2bb99f5-6016-415e-a619-9b3fba9c06d2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Mark Walkom) #6

It's a good idea not to embed/attach images as this list does go to a lot
of people.
It'd be better to just link to them from an image hosting site :slight_smile:

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 23 June 2014 20:58, Klavs Klavsen klavs@enableit.dk wrote:

https://lh4.googleusercontent.com/--R2XitJ2QVg/U6gILB_bwoI/AAAAAAAAAFI/OkdS3_Sf_xE/s1600/ES-graphs1.png
wrong.. that's not the image of when it happened last.. I'm currently
working on fixing zoom issues on grafana - so can't give detailed zoom yet.

Here's the last 24h - it happened at approx. 16:00 yesterday - and I fixed
it by restarting the cluster at 20:00 (I could have probably just restarted
the node that had written something about GC in its logs..)

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f2bb99f5-6016-415e-a619-9b3fba9c06d2%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f2bb99f5-6016-415e-a619-9b3fba9c06d2%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624ZjV1eJKcFyGEjjqCf9nVSMDHun%2BFro9J_N_X5XjBE6jg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Klavs Klavsen) #7

ES yml:
index.warmer.enabled: false
cluster.name: elasticsearch
node.name: "p-elasticlog02"
node.master: true
node.data: true
action.disable_delete_all_indices: true
indices.memory.index_buffer_size: 50%
indices.fielddata.cache.size: 30%
index.refresh_interval: 5s
index.index_concurrency: 16
threadpool.search.type: fixed
threadpool.search.size: 400
threadpool.search.queue_size: 900
threadpool.bulk.type: fixed
threadpool.bulk.size: 500
threadpool.bulk.queue_size: 900
threadpool.index.type: fixed
threadpool.index.size: 300
threadpool.index.queue_size: -1
path.data: /var/lib/elasticsearch/
bootstrap.mlockall: true
network.publish_host: 10.213.146.17
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: [array of nodes]

{
"cluster_name" : "elasticsearch",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 15,
"number_of_data_nodes" : 4,
"active_primary_shards" : 985,
"active_shards" : 1970,
"relocating_shards" : 2,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

Searched 985 of 985 shards. 3790288072 hits.

It's used for logstash - so it's not huge. On disk size is 3.5TB - spread
across 4 nodes.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/86730969-e6b3-4893-83c9-e8f0055ff45b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Mark Walkom) #8

That's actually quite a lot of data. I'd recommend dropping/closing some
old indexes and/or adding another node, also changing to Oracle java will
give you a bit more breathing room.
Any other changes you could make (eg disable bloom filter) would be minor
and be a diminishing return.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 23 June 2014 21:06, Klavs Klavsen klavs@enableit.dk wrote:

ES yml:
index.warmer.enabled: false
cluster.name: elasticsearch
node.name: "p-elasticlog02"
node.master: true
node.data: true
action.disable_delete_all_indices: true
indices.memory.index_buffer_size: 50%
indices.fielddata.cache.size: 30%
index.refresh_interval: 5s
index.index_concurrency: 16
threadpool.search.type: fixed
threadpool.search.size: 400
threadpool.search.queue_size: 900
threadpool.bulk.type: fixed
threadpool.bulk.size: 500
threadpool.bulk.queue_size: 900
threadpool.index.type: fixed
threadpool.index.size: 300
threadpool.index.queue_size: -1
path.data: /var/lib/elasticsearch/
bootstrap.mlockall: true
network.publish_host: 10.213.146.17
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: [array of nodes]

{
"cluster_name" : "elasticsearch",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 15,
"number_of_data_nodes" : 4,
"active_primary_shards" : 985,
"active_shards" : 1970,
"relocating_shards" : 2,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

Searched 985 of 985 shards. 3790288072 hits.

It's used for logstash - so it's not huge. On disk size is 3.5TB - spread
across 4 nodes.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/86730969-e6b3-4893-83c9-e8f0055ff45b%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/86730969-e6b3-4893-83c9-e8f0055ff45b%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624Y_WKOgpy__9%3Dth3ndKi1UkuTQB2%3Df7q5Ai8WkU3VTjcQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #9