Elasticsearch dies every other day

my cluster - running 1.1.2 with oracle java 1.7_55 - dies several days a
week
one the nodes gets "disconnected"..
this time one of them logged: observer timed out. notifying listener.
several times
I then have to restart them all - this time I had several who thought they
were masters
they are physical machines on the same LAN
the machines (4) index between 70 and 130k documents (lines from logstash)
per minute. I input into different indexes (not just logstash-$date)

after having a queue built up - they will easily index 260 to 290k/min -
until the queue is emptied
so they seem to have no resource shortage, but they somehow "get tired and
die"- very often :frowning:

any ideas how I should proceed in debugging this issue? It seems to fit
that EVERY time a node dies - I have garbage collection log entries. This
time I had these on one node:
[2014-07-12 17:12:25,132][INFO ][monitor.jvm ]
[p-elasticlog02] [gc][young][194961][25004] duration [767ms], collections
[1]/[1s], total [767ms]/[34.3m], memory [27.7gb]->[26.7gb]/[31.7gb],
all_pools {[young] [1gb]->[5.3mb]/[1.1
gb]}{[survivor] [149.7mb]->[124.1mb]/[149.7mb]}{[old]
[26.5gb]->[26.6gb]/[30.4gb]}
[2014-07-12 17:12:44,929][INFO ][monitor.jvm ]
[p-elasticlog02] [gc][young][194980][25007] duration [804ms], collections
[1]/[1.1s], total [804ms]/[34.3m], memory [27.8gb]->[26.9gb]/[31.7gb],
all_pools {[young] [1gb]->[11.2mb]/[
1.1gb]}{[survivor] [149.7mb]->[146.1mb]/[149.7mb]}{[old]
[26.6gb]->[26.7gb]/[30.4gb]}
[2014-07-12 17:14:57,032][INFO ][monitor.jvm ]
[p-elasticlog02] [gc][young][195109][25035] duration [837ms], collections
[1]/[1s], total [837ms]/[34.4m], memory [28.9gb]->[28.1gb]/[31.7gb],
all_pools {[young] [1gb]->[141.1mb]/[1
.1gb]}{[survivor] [149.7mb]->[145.7mb]/[149.7mb]}{[old]
[27.7gb]->[27.8gb]/[30.4gb]}
[2014-07-12 17:16:17,016][INFO ][monitor.jvm ]
[p-elasticlog02] [gc][young][195187][25053] duration [756ms], collections
[1]/[1.4s], total [756ms]/[34.5m], memory [29.5gb]->[28.7gb]/[31.7gb],
all_pools {[young] [926.8mb]->[27.6m
b]/[1.1gb]}{[survivor] [149.7mb]->[138.6mb]/[149.7mb]}{[old]
[28.4gb]->[28.5gb]/[30.4gb]}
[2014-07-12 17:18:57,313][WARN ][monitor.jvm ]
[p-elasticlog02] [gc][young][195303][25075] duration [1.1s], collections
[1]/[40.6s], total [1.1s]/[34.6m], memory [30.4gb]->[29.2gb]/[31.7gb],
all_pools {[young] [1.1gb]->[12.3mb]/
[1.1gb]}{[survivor] [149.7mb]->[0b]/[149.7mb]}{[old]
[29.1gb]->[29.2gb]/[30.4gb]}
[2014-07-12 17:18:57,314][WARN ][monitor.jvm ]
[p-elasticlog02] [gc][old][195303][53] duration [39.1s], collections
[2]/[40.6s], total [39.1s]/[1.5m], memory [30.4gb]->[29.2gb]/[31.7gb],
all_pools {[young] [1.1gb]->[12.3mb]/[1.1
gb]}{[survivor] [149.7mb]->[0b]/[149.7mb]}{[old]
[29.1gb]->[29.2gb]/[30.4gb]}

I collect a lot of counters from elasticsearch (Using elasticsearch
collector in diamond (graphite collector written in python)) - so I have
data on the ES nodes.

my config is this:
index.warmer.enabled: false
cluster.name: elasticsearch
node.name: "p-elasticlog02"
node.master: true
node.data: true
action.disable_delete_all_indices: true
indices.memory.index_buffer_size: 50%
indices.fielddata.cache.size: 30%
index.refresh_interval: 5s
index.index_concurrency: 16
threadpool.search.type: fixed
threadpool.search.size: 400
threadpool.search.queue_size: 900
threadpool.bulk.type: fixed
threadpool.bulk.size: 500
threadpool.bulk.queue_size: 900
threadpool.index.type: fixed
threadpool.index.size: 300
threadpool.index.queue_size: -1
path.data: /var/lib/elasticsearch/
bootstrap.mlockall: true
network.publish_host: $hostip
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["p-elasticlog01.example.idk",
"p-elasticlog02.example.idk", "p-elasticlog03.example.idk",
"p-elasticlog04.example.idk", "p-elasticlog05.example.idk"]

I have 24 cores in each machine (doing pretty much nothing) - so I was
considering trying to switch to g1gc f.ex. - as I've read it should be
better in some respects.. any input?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f485903b-b34c-4d58-bbf0-9a60a3866afe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

What's your heap size how much data do you have in your indexes?
Take a look at settings like discovery.zen.minimum_master_nodes to reduce
the multi-master problem you see, although ideally you want an odd number
of nodes for that (eg 3 or 5).

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 13 July 2014 04:22, Klavs Klavsen klavs@enableit.dk wrote:

my cluster - running 1.1.2 with oracle java 1.7_55 - dies several days a
week
one the nodes gets "disconnected"..
this time one of them logged: observer timed out. notifying listener.
several times
I then have to restart them all - this time I had several who thought they
were masters
they are physical machines on the same LAN
the machines (4) index between 70 and 130k documents (lines from logstash)
per minute. I input into different indexes (not just logstash-$date)

after having a queue built up - they will easily index 260 to 290k/min -
until the queue is emptied
so they seem to have no resource shortage, but they somehow "get tired and
die"- very often :frowning:

any ideas how I should proceed in debugging this issue? It seems to fit
that EVERY time a node dies - I have garbage collection log entries. This
time I had these on one node:
[2014-07-12 17:12:25,132][INFO ][monitor.jvm ]
[p-elasticlog02] [gc][young][194961][25004] duration [767ms], collections
[1]/[1s], total [767ms]/[34.3m], memory [27.7gb]->[26.7gb]/[31.7gb],
all_pools {[young] [1gb]->[5.3mb]/[1.1
gb]}{[survivor] [149.7mb]->[124.1mb]/[149.7mb]}{[old]
[26.5gb]->[26.6gb]/[30.4gb]}
[2014-07-12 17:12:44,929][INFO ][monitor.jvm ]
[p-elasticlog02] [gc][young][194980][25007] duration [804ms], collections
[1]/[1.1s], total [804ms]/[34.3m], memory [27.8gb]->[26.9gb]/[31.7gb],
all_pools {[young] [1gb]->[11.2mb]/[
1.1gb]}{[survivor] [149.7mb]->[146.1mb]/[149.7mb]}{[old]
[26.6gb]->[26.7gb]/[30.4gb]}
[2014-07-12 17:14:57,032][INFO ][monitor.jvm ]
[p-elasticlog02] [gc][young][195109][25035] duration [837ms], collections
[1]/[1s], total [837ms]/[34.4m], memory [28.9gb]->[28.1gb]/[31.7gb],
all_pools {[young] [1gb]->[141.1mb]/[1
.1gb]}{[survivor] [149.7mb]->[145.7mb]/[149.7mb]}{[old]
[27.7gb]->[27.8gb]/[30.4gb]}
[2014-07-12 17:16:17,016][INFO ][monitor.jvm ]
[p-elasticlog02] [gc][young][195187][25053] duration [756ms], collections
[1]/[1.4s], total [756ms]/[34.5m], memory [29.5gb]->[28.7gb]/[31.7gb],
all_pools {[young] [926.8mb]->[27.6m
b]/[1.1gb]}{[survivor] [149.7mb]->[138.6mb]/[149.7mb]}{[old]
[28.4gb]->[28.5gb]/[30.4gb]}
[2014-07-12 17:18:57,313][WARN ][monitor.jvm ]
[p-elasticlog02] [gc][young][195303][25075] duration [1.1s], collections
[1]/[40.6s], total [1.1s]/[34.6m], memory [30.4gb]->[29.2gb]/[31.7gb],
all_pools {[young] [1.1gb]->[12.3mb]/
[1.1gb]}{[survivor] [149.7mb]->[0b]/[149.7mb]}{[old]
[29.1gb]->[29.2gb]/[30.4gb]}
[2014-07-12 17:18:57,314][WARN ][monitor.jvm ]
[p-elasticlog02] [gc][old][195303][53] duration [39.1s], collections
[2]/[40.6s], total [39.1s]/[1.5m], memory [30.4gb]->[29.2gb]/[31.7gb],
all_pools {[young] [1.1gb]->[12.3mb]/[1.1
gb]}{[survivor] [149.7mb]->[0b]/[149.7mb]}{[old]
[29.1gb]->[29.2gb]/[30.4gb]}

I collect a lot of counters from elasticsearch (Using elasticsearch
collector in diamond (graphite collector written in python)) - so I have
data on the ES nodes.

my config is this:
index.warmer.enabled: false
cluster.name: elasticsearch
node.name: "p-elasticlog02"
node.master: true
node.data: true
action.disable_delete_all_indices: true
indices.memory.index_buffer_size: 50%
indices.fielddata.cache.size: 30%
index.refresh_interval: 5s
index.index_concurrency: 16
threadpool.search.type: fixed
threadpool.search.size: 400
threadpool.search.queue_size: 900
threadpool.bulk.type: fixed
threadpool.bulk.size: 500
threadpool.bulk.queue_size: 900
threadpool.index.type: fixed
threadpool.index.size: 300
threadpool.index.queue_size: -1
path.data: /var/lib/elasticsearch/
bootstrap.mlockall: true
network.publish_host: $hostip
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["p-elasticlog01.example.idk",
"p-elasticlog02.example.idk", "p-elasticlog03.example.idk",
"p-elasticlog04.example.idk", "p-elasticlog05.example.idk"]

I have 24 cores in each machine (doing pretty much nothing) - so I was
considering trying to switch to g1gc f.ex. - as I've read it should be
better in some respects.. any input?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f485903b-b34c-4d58-bbf0-9a60a3866afe%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f485903b-b34c-4d58-bbf0-9a60a3866afe%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624aALkP-b4tf%2BzsSjgGni9GsRu4CLMfRzDQJYt806bdPWw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Maybe your GC times are pausing the node for longer than the default 30s
zen ping timeout. G1GC should produce shorter pauses although playing
around with your heap size with CMS could also work. I guess your indexing
hard such that there's pressure to push new objects into the old gen
prematurely, which then results in long pauses when it tries to clean up.

I'd probably just go with G1GC to reduce pause times and set minimum nodes
settings. It you wanted to have a look to see what GC is doing try turning
on PrintGCDetails and related flags and use a log analyzer to see if that
gives you any clues.

On Sunday, July 13, 2014 12:22:46 AM UTC+1, Mark Walkom wrote:

What's your heap size how much data do you have in your indexes?
Take a look at settings like discovery.zen.minimum_master_nodes to reduce
the multi-master problem you see, although ideally you want an odd number
of nodes for that (eg 3 or 5).

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com <javascript:>
web: www.campaignmonitor.com

On 13 July 2014 04:22, Klavs Klavsen <kl...@enableit.dk <javascript:>>
wrote:

my cluster - running 1.1.2 with oracle java 1.7_55 - dies several days a
week
one the nodes gets "disconnected"..
this time one of them logged: observer timed out. notifying listener.
several times
I then have to restart them all - this time I had several who thought
they were masters
they are physical machines on the same LAN
the machines (4) index between 70 and 130k documents (lines from
logstash) per minute. I input into different indexes (not just
logstash-$date)

after having a queue built up - they will easily index 260 to 290k/min -
until the queue is emptied
so they seem to have no resource shortage, but they somehow "get tired
and die"- very often :frowning:

any ideas how I should proceed in debugging this issue? It seems to fit
that EVERY time a node dies - I have garbage collection log entries. This
time I had these on one node:
[2014-07-12 17:12:25,132][INFO ][monitor.jvm ]
[p-elasticlog02] [gc][young][194961][25004] duration [767ms], collections
[1]/[1s], total [767ms]/[34.3m], memory [27.7gb]->[26.7gb]/[31.7gb],
all_pools {[young] [1gb]->[5.3mb]/[1.1
gb]}{[survivor] [149.7mb]->[124.1mb]/[149.7mb]}{[old]
[26.5gb]->[26.6gb]/[30.4gb]}
[2014-07-12 17:12:44,929][INFO ][monitor.jvm ]
[p-elasticlog02] [gc][young][194980][25007] duration [804ms], collections
[1]/[1.1s], total [804ms]/[34.3m], memory [27.8gb]->[26.9gb]/[31.7gb],
all_pools {[young] [1gb]->[11.2mb]/[
1.1gb]}{[survivor] [149.7mb]->[146.1mb]/[149.7mb]}{[old]
[26.6gb]->[26.7gb]/[30.4gb]}
[2014-07-12 17:14:57,032][INFO ][monitor.jvm ]
[p-elasticlog02] [gc][young][195109][25035] duration [837ms], collections
[1]/[1s], total [837ms]/[34.4m], memory [28.9gb]->[28.1gb]/[31.7gb],
all_pools {[young] [1gb]->[141.1mb]/[1
.1gb]}{[survivor] [149.7mb]->[145.7mb]/[149.7mb]}{[old]
[27.7gb]->[27.8gb]/[30.4gb]}
[2014-07-12 17:16:17,016][INFO ][monitor.jvm ]
[p-elasticlog02] [gc][young][195187][25053] duration [756ms], collections
[1]/[1.4s], total [756ms]/[34.5m], memory [29.5gb]->[28.7gb]/[31.7gb],
all_pools {[young] [926.8mb]->[27.6m
b]/[1.1gb]}{[survivor] [149.7mb]->[138.6mb]/[149.7mb]}{[old]
[28.4gb]->[28.5gb]/[30.4gb]}
[2014-07-12 17:18:57,313][WARN ][monitor.jvm ]
[p-elasticlog02] [gc][young][195303][25075] duration [1.1s], collections
[1]/[40.6s], total [1.1s]/[34.6m], memory [30.4gb]->[29.2gb]/[31.7gb],
all_pools {[young] [1.1gb]->[12.3mb]/
[1.1gb]}{[survivor] [149.7mb]->[0b]/[149.7mb]}{[old]
[29.1gb]->[29.2gb]/[30.4gb]}
[2014-07-12 17:18:57,314][WARN ][monitor.jvm ]
[p-elasticlog02] [gc][old][195303][53] duration [39.1s], collections
[2]/[40.6s], total [39.1s]/[1.5m], memory [30.4gb]->[29.2gb]/[31.7gb],
all_pools {[young] [1.1gb]->[12.3mb]/[1.1
gb]}{[survivor] [149.7mb]->[0b]/[149.7mb]}{[old]
[29.1gb]->[29.2gb]/[30.4gb]}

I collect a lot of counters from elasticsearch (Using elasticsearch
collector in diamond (graphite collector written in python)) - so I have
data on the ES nodes.

my config is this:
index.warmer.enabled: false
cluster.name: elasticsearch
node.name: "p-elasticlog02"
node.master: true
node.data: true
action.disable_delete_all_indices: true
indices.memory.index_buffer_size: 50%
indices.fielddata.cache.size: 30%
index.refresh_interval: 5s
index.index_concurrency: 16
threadpool.search.type: fixed
threadpool.search.size: 400
threadpool.search.queue_size: 900
threadpool.bulk.type: fixed
threadpool.bulk.size: 500
threadpool.bulk.queue_size: 900
threadpool.index.type: fixed
threadpool.index.size: 300
threadpool.index.queue_size: -1
path.data: /var/lib/elasticsearch/
bootstrap.mlockall: true
network.publish_host: $hostip
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["p-elasticlog01.example.idk",
"p-elasticlog02.example.idk", "p-elasticlog03.example.idk",
"p-elasticlog04.example.idk", "p-elasticlog05.example.idk"]

I have 24 cores in each machine (doing pretty much nothing) - so I was
considering trying to switch to g1gc f.ex. - as I've read it should be
better in some respects.. any input?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f485903b-b34c-4d58-bbf0-9a60a3866afe%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f485903b-b34c-4d58-bbf0-9a60a3866afe%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/322ce7e2-42bf-4115-a331-b8508a949d77%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

You are certainly having a heap utilization issue. As the utilization gets close to 100% the GCs get aggressive.

Is your heap utilization staying close to the edge? Does it jump up at the end? What about field cache? What about hot threads?

200+ per second is higher than needed indexing. You may want to try scalling back index threads.

Otherwise if it's not a huge spike you may need more memory.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cb2e2b9b-2821-4b07-a85f-9bb51873b14a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

With this setting, you allow the flooding of your heap when using single
doc indexing. Now you ask for the reason why your heap collapses. Not sure
how Logstash indexer works but you should review your custom settings.

Jörg

On Saturday, July 12, 2014 8:22:33 PM UTC+2, Klavs Klavsen wrote:

threadpool.index.queue_size: -1

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3ee40446-452d-46fc-a9fe-9c2be7e53391%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Changing it to:
threadpool.index.queue_size: 900

hope it helps :slight_smile:

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f534e2a3-c3b9-4f7b-ada6-c48527192bbd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Den søndag den 13. juli 2014 01.22.46 UTC+2 skrev Mark Walkom:

What's your heap size how much data do you have in your indexes?

data used = approx. 4,5TB
heapsize = 32700M (less than 32GB :slight_smile:

Heap memory used usually runs between 24 and 28GB - but when this happens,
I see heap usage go to max heapsize.

5505713372 hits in 1322 shards.

Take a look at settings like discovery.zen.minimum_master_nodes to reduce

the multi-master problem you see, although ideally you want an odd number
of nodes for that (eg 3 or 5).

I have 4 data nodes and one none-data node (does it count as a 5.?)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1febddfc-9382-4f41-85d5-ff9fcd7df81b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Here's the heap usage graph for the last week:

I've marked where it crashes. I hope this can help give an idea of what
goes wrong and what I could try?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b0dc1d93-97dd-49aa-809e-db3d46fe0de1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Is the non data node a client node? Here we are counting master eligible nodes. Whether you havever 4 or 5 I would go with min master eligible nodes 3. I'm sure what your replica setup is, but only 2 nodes is probably not a healthy cluster.

With this set to 3 you won't have a split brain.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b7b4fb77-478e-4709-9f6f-60bfad67feef%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Do you have thread counts and do any of them correlate to the crash times? I'm guessing that we'll find index threads leap up.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fab19f43-8735-4208-873f-a4575a05b313%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Looks like your cluster is at capacity.
Try closing/deleting some old indexes, adding more nodes or more RAM.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 14 July 2014 23:22, smonasco smonasco@gmail.com wrote:

Do you have thread counts and do any of them correlate to the crash times?
I'm guessing that we'll find index threads leap up.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/fab19f43-8735-4208-873f-a4575a05b313%40googlegroups.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624bY6_%2B_6cpiCoWRVToeKVK0%3DhQsGM7kQ8LSzVa%3D%2Ba7%3Dwg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

define "at capacity" ?
I can easily give it more ram.. it has had 42GB -but that just made it jump
even more :slight_smile:

I changed index queue to 900 (instead of -1) and set two of the four nodes
to using g1gc - which has altered the GC very much (it collects very
frequently) - and those were the ones that died most often - and so far,
they have not died.

heap memory usage is ~28GB at max. still letting it run for a few days.
I simultaneously (I know that was stupid.. ) moved some content, that I
know they search over longer periods (7 days and longer) into it's own
index - so they do not search the entire logstash index.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9f306df3-f9d7-44ec-8701-b63d5b3e51f7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Every system has a limit of what it can process or maintain, you're
probably very near your limit based on your setup.
This is a fluid limit as it depends a lot on what you do, but generally if
you aren't doing anything beyond what you'd consider "normal" and you are
getting GC problems and crashing nodes, you're likely to be at that limit.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 15 July 2014 22:32, Klavs Klavsen klavs@enableit.dk wrote:

define "at capacity" ?
I can easily give it more ram.. it has had 42GB -but that just made it
jump even more :slight_smile:

I changed index queue to 900 (instead of -1) and set two of the four nodes
to using g1gc - which has altered the GC very much (it collects very
frequently) - and those were the ones that died most often - and so far,
they have not died.

heap memory usage is ~28GB at max. still letting it run for a few days.
I simultaneously (I know that was stupid.. ) moved some content, that I
know they search over longer periods (7 days and longer) into it's own
index - so they do not search the entire logstash index.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/9f306df3-f9d7-44ec-8701-b63d5b3e51f7%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/9f306df3-f9d7-44ec-8701-b63d5b3e51f7%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624Z_MS5aOMAiBUo_yBh7vnEYTVcuL-UNbB1xWrM-Rn38Ow%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

I updated the graph:

I added overview of how many threads were running, and its appearent that
what peaked when it crashed (left side of the two graphs - two spikes where
it crashed) - correlated with a peak in search threads.
Also the change to G1GC for two of the nodes - is very appearent in the
heap_mem_usage graph :slight_smile:

It's been stable for two days now.. nearing the record :slight_smile: I did also move a
"culprit" who searched the main index for larger periods, to their own
index.. and change threadpool.index.queue_size: from -1 to 900.
the index queue size does not seem to be hit at all, so I'm not sure that
made a change.

Thank you for your input everyone.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/dd7fac5b-e97f-4abc-94e4-23e66efe5c36%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hey! Your graphs are really nice. That looks like grafana. I was wondering
how you're pipiing data there? I used the ES graphite plugin and found that
it flooded my graphite with too much data.

Thanks

On Wednesday, July 16, 2014 9:26:46 AM UTC+2, Klavs Klavsen wrote:

I updated the graph:
http://blog.klavsen.info/ES-graphs-update.png

I added overview of how many threads were running, and its appearent that
what peaked when it crashed (left side of the two graphs - two spikes where
it crashed) - correlated with a peak in search threads.
Also the change to G1GC for two of the nodes - is very appearent in the
heap_mem_usage graph :slight_smile:

It's been stable for two days now.. nearing the record :slight_smile: I did also move
a "culprit" who searched the main index for larger periods, to their own
index.. and change threadpool.index.queue_size: from -1 to 900.
the index queue size does not seem to be hit at all, so I'm not sure that
made a change.

Thank you for your input everyone.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ff6462a4-e99a-4b82-b45a-303a9c87c3bd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.