ElasticSearch Multi-Node Configuration issue with Marvel


(Rahul Nadella) #1

We have vertically scaled ElasticSearch in our cluster - 3 nodes (which seems to be working through performance testing) but when we attempt to add Marvel to this we seem to be getting OutofMemoryError exceptions. We currently have set marvel.agent.exporter.es.hosts in all node configurations but have not seem to be able to solve this issue. Any ideas on how to fix this (error log below)?

ElasticSearch Version - 1.5.2
Marvel Version - 1.3
Heap Size (per node) - 7GB
Open Files (per node) - 65535
Memlock (per node) - unlimited


(Rahul Nadella) #2

[2015-11-23 14:14:27,941][INFO ][cluster.service ] [-es3] detected_master [-es1][m5Ds1kUQQI2ZKelWJax8OA][SVCentral1][inet[/166.17.49.8:9300]]{master=true}, added {[-es1][m5Ds1kUQQI2ZKelWJax8OA][SVCentral1][inet[/166.17.49.8:9300]]{master=true},[-es2][nB6Y4WA2S0ynzGQ-C0Vclg][SVCentral1][inet[/166.17.49.8:9301]]{master=true},}, reason: zen-disco-receive(from master [[-es1][m5Ds1kUQQI2ZKelWJax8OA][SVCentral1][inet[/166.17.49.8:9300]]{master=true}])
[2015-11-23 14:14:27,955][INFO ][marvel.agent.exporter ] [-es3] hosts set to [localhost:9200]
[2015-11-23 14:14:28,043][INFO ][http ] [-es3] bound_address {inet[/0:0:0:0:0:0:0:0:9202]}, publish_address {inet[/166.17.49.8:9202]}
[2015-11-23 14:14:28,043][INFO ][node ] [-es3] started
[2015-11-23 14:23:39,551][WARN ][index.engine ] [-es2] [.marvel-2015.11.23][0] failed engine [out of memory]
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at org.apache.lucene.index.ConcurrentMergeScheduler.merge(ConcurrentMergeScheduler.java:391)
at org.elasticsearch.index.merge.EnableMergeScheduler.merge(EnableMergeScheduler.java:50)
at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1985)
at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1979)
at org.elasticsearch.index.engine.InternalEngine.maybeMerge(InternalEngine.java:741)
at org.elasticsearch.index.shard.IndexShard$EngineMerger$1.run(IndexShard.java:1148)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-11-23 14:23:39,554][WARN ][index.shard ] [-es2] [.marvel-2015.11.23][0] Failed to perform scheduled engine optimize/merge
org.elasticsearch.index.engine.OptimizeFailedEngineException: [.marvel-2015.11.23][0] force merge failed
at org.elasticsearch.index.engine.InternalEngine.maybeMerge(InternalEngine.java:744)
at org.elasticsearch.index.shard.IndexShard$EngineMerger$1.run(IndexShard.java:1148)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at org.apache.lucene.index.ConcurrentMergeScheduler.merge(ConcurrentMergeScheduler.java:391)
at org.elasticsearch.index.merge.EnableMergeScheduler.merge(EnableMergeScheduler.java:50)
at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1985)
at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1979)
at org.elasticsearch.index.engine.InternalEngine.maybeMerge(InternalEngine.java:741)
... 4 more


(Mark Walkom) #3

What's the ES config look like?


(Rahul Nadella) #4

So basically I ran into a similar with this in performance testing (where we are pushing 2GBPS through Logstash over a 12 hour period with a two hour spike to 4GBPS) while trying to search the cluster. Seems like the same issue with the es3 throwing an out of memory exception.

I have 3 node (es1, es2, es3) system (vertically scaled) where all the nodes are data nodes capable of being master nodes. Each node has it separate data, work, configuration, and logging directories/files. I wondering if we are missing a key setting that will help us avoid getting this exception. We are planning to move to doc_values and set the cluster.routing.allocation.same_shard.host: true for the next test.

Server specs: 2.6 GHZ 20 cores, 132GB/data 9.9TB

es1 (same for es2, es3 except for the node name, config, logging, data, work)
cluster.name: sv
node.name: es1
node.master: true
node.data: true
index.number_of_shards: 1
index.number_of_replicas: 0
path.conf: /etc/elasticsearch
path.data: /data/elasticsearch/data-es1
path.work: /data/elasticsearch/work-es1
path.logs: /var/log/elasticsearch
path.plugins: /usr/share/elasticsearch/plugins
bootstrap.mlockall: false
http.port: 9200
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ['172.2.2.2:9300]
index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s
index.search.slowlog.threshold.query.debug: 2s
index.search.slowlog.threshold.query.trace: 500ms
index.search.slowlog.threshold.fetch.warn: 1s
index.search.slowlog.threshold.fetch.info: 800ms
index.search.slowlog.threshold.fetch.debug: 500ms
index.search.slowlog.threshold.fetch.trace: 200ms
index.search.slowlog.threshold.index.warn: 10s
index.search.slowlog.threshold.index.info: 5s
index.search.slowlog.threshold.index.debug: 2s
index.search.slowlog.threshold.index.trace: 500ms
monitor.jvm.gc.young.warn: 1000ms
monitor.jvm.gc.young.info: 700ms
monitor.jvm.gc.young.debug: 400ms
monitor.jvm.gc.old.warn: 10s
monitor.jvm.gc.old.info: 5s
monitor.jvm.gc.old.debug: 2s


(Rahul Nadella) #5

`[2015-11-24 17:39:50,126][WARN ][monitor.jvm ] [-es3] [gc][young][83323][3329] duration [1.5s], collections [1]/[1.8s], total [1.5s]/[27.6s], memory [4.8gb]->[3gb]/[6.7gb], all_pools {[young] [1.8gb]->[1.9mb]/[1.8gb]}{[survivor] [139.8mb]->[96.3mb]/[232.9mb]}{[old] [2.8gb]->[2.9gb]/[4.7gb]}
[2015-11-24 19:22:35,102][INFO ][monitor.jvm ] [-es3] [gc][young][89485][3840] duration [887ms], collections [1]/[1.8s], total [887ms]/[33.2s], memory [4.8gb]->[3.3gb]/[6.7gb], all_pools {[young] [1.4gb]->[9.9mb]/[1.8gb]}{[survivor] [117.5mb]->[118.6mb]/[232.9mb]}{[old] [3.2gb]->[3.2gb]/[4.7gb]}
[2015-11-24 19:29:32,372][WARN ][index.engine ] [-es3] [domain_metadata-2015-11-25][0] failed engine [out of memory]
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at org.apache.lucene.index.ConcurrentMergeScheduler.merge(ConcurrentMergeScheduler.java:391)
at org.elasticsearch.index.merge.EnableMergeScheduler.merge(EnableMergeScheduler.java:50)
at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1985)
at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1979)
at org.elasticsearch.index.engine.InternalEngine.maybeMerge(InternalEngine.java:741)
at org.elasticsearch.index.shard.IndexShard$EngineMerger$1.run(IndexShard.java:1148)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-11-24 19:29:32,376][WARN ][index.shard ] [-es3] [domain_metadata-2015-11-25][0] Failed to perform scheduled engine optimize/merge
org.elasticsearch.index.engine.OptimizeFailedEngineException: [domain_metadata-2015-11-25][0] force merge failed
at org.elasticsearch.index.engine.InternalEngine.maybeMerge(InternalEngine.java:744)
at org.elasticsearch.index.shard.IndexShard$EngineMerger$1.run(IndexShard.java:1148)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at org.apache.lucene.index.ConcurrentMergeScheduler.merge(ConcurrentMergeScheduler.java:391)
at org.elasticsearch.index.merge.EnableMergeScheduler.merge(EnableMergeScheduler.java:50)
at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1985)
at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1979)
at org.elasticsearch.index.engine.InternalEngine.maybeMerge(InternalEngine.java:741)
... 4 more
[2015-11-24 19:29:32,376][WARN ][indices.cluster ] [-es3] [[domain_metadata-2015-11-25][0]] marking and sending shard failed due to [engine failure, reason [out of memory]]
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at org.apache.lucene.index.ConcurrentMergeScheduler.merge(ConcurrentMergeScheduler.java:391)
at org.elasticsearch.index.merge.EnableMergeScheduler.merge(EnableMergeScheduler.java:50)
at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1985)
at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1979)
at org.elasticsearch.index.engine.InternalEngine.maybeMerge(InternalEngine.java:741)
at org.elasticsearch.index.shard.IndexShard$EngineMerger$1.run(IndexShard.java:1148)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-11-24 19:29:32,472][WARN ][indices.cluster ] [-es3] [[dnsmon-2015-11-24][0]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [dnsmon-2015-11-24][0]: Recovery failed from [-es1][xMkrv9zTQGyED8WFTWx5vA][SVCentral1][inet[/166.17.49.8:9300]]{master=true} into [-es3][c39LaO3ETVGh2wmxLhQ2yw][SVCentral1][inet[/166.17.49.8:9302]]{master=true}


(Rahul Nadella) #6

Setting the processors per node configuration (elasticsearch.yml) fixed this issue.


(system) #7