Out of heap error on machines with 18GB heap and 6GB index


(Justin Zhu) #1

We have a 3 node cluster, each with 30gb total memory, 18gb allocated to
elasticsearch and replicas set at 2. Our largest index is 6GB.

After running for a few days, the cluster would go down with Java out of
heap errors. We currently have a multi-get aggregation that issues 40
requests on the same index, to get unique counts for a list of document
types, it's quite unoptimized atm as it touches the whole index. Query
below.

{
"size" : 0,
"query" : {
"filtered" : {
"query" : {
"match_all" : { }
},
"filter" : {
"bool" : {
"must" : {
"term" : {
"campaignId" : 1914
}
}
}
}
}
},
"aggregations" : {
"distinct_count" : {
"cardinality" : {
"field" : "email",
"precision_threshold" : 40000
}
}
}
}

Since there's 18GB of memory, I'd expected it not to fail. Is there a
setting with cache eviction we need to set?

Here's what we see in the logs:

[2014-05-13 18:17:36,653][WARN ][cluster.action.shard ]
[elasticsearch-i2-1] [prod-project-55][1] received shard failed for
[prod-project-55][1], node[5tMu5N29SX6-4moapYi9kg], [P], s[STARTED],
indexUUID [na], reason [master
[elasticsearch-i2-1][-RkMYiRzRjGRYyfmotvJ2A][ip-10-0-0-97.ec2.internal][inet[/10.0.0.97:9300]]{aws_availability_zone=us-east-1a,
max_local_storage_nodes=1} marked shard as started, but shard has not been
created, mark shard as failed]
[2014-05-13 18:17:36,653][WARN ][cluster.action.shard ]
[elasticsearch-i2-1] [prod-project-53][1] received shard failed for
[prod-project-53][1], node[5tMu5N29SX6-4moapYi9kg], [P], s[STARTED],
indexUUID [na], reason [master
[elasticsearch-i2-1][-RkMYiRzRjGRYyfmotvJ2A][ip-10-0-0-97.ec2.internal][inet[/10.0.0.97:9300]]{aws_availability_zone=us-east-1a,
max_local_storage_nodes=1} marked shard as started, but shard has not been
created, mark shard as failed]
[2014-05-13 18:17:36,654][WARN ][cluster.action.shard ]
[elasticsearch-i2-1] [prod-project-55][4] received shard failed for
[prod-project-55][4], node[5tMu5N29SX6-4moapYi9kg], [P], s[STARTED],
indexUUID [na], reason [master
[elasticsearch-i2-1][-RkMYiRzRjGRYyfmotvJ2A][ip-10-0-0-97.ec2.internal][inet[/10.0.0.97:9300]]{aws_availability_zone=us-east-1a,
max_local_storage_nodes=1} marked shard as started, but shard has not been
created, mark shard as failed]
[2014-05-13 18:18:10,599][DEBUG][action.admin.indices.create]
[elasticsearch-i2-1] [14851415c4714c98a447b688af2ead39] failed to create
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException:
failed to process cluster event (acquire index lock) within 30s
at
org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$1.run(MetaDataCreateIndexService.java:141)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
[2014-05-13 18:18:10,600][DEBUG][action.search.type ]
[elasticsearch-i2-1] [prod-project-52-5-1][0],
node[-RkMYiRzRjGRYyfmotvJ2A], [R], s[STARTED]: Failed to execute
[org.elasticsearch.action.search.SearchRequest@7d8077d2] lastShard [true]
org.elasticsearch.ElasticsearchException: Java heap space
at
org.elasticsearch.ExceptionsHelper.convertToRuntime(ExceptionsHelper.java:37)
at
org.elasticsearch.search.SearchService.createContext(SearchService.java:531)
at
org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:480)
at
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:252)
at
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:202)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.sendExecuteFirstPhase(TransportSearchQueryThenFetchAction.java:80)
at
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:216)
at
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$4.run(TransportSearchTypeAction.java:296)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.OutOfMemoryError: Java heap space
[2014-05-13 18:18:16,694][WARN ][transport ]
[elasticsearch-i2-1] Received response for a request that has timed out,
sent [45926ms] ago, timed out [1024ms] ago, action
[/cluster/nodes/indices/shard/store/n], node
[[es3][5tMu5N29SX6-4moapYi9kg][ip-10-0-0-150.ec2.internal][inet[/10.0.0.150:9300]]{aws_availability_zone=us-east-1a,
max_local_storage_nodes=1}], id [5477680]
[2014-05-13 18:18:17,335][INFO ][cluster.service ]
[elasticsearch-i2-1] removed
{[elasticsearch-i2-2][fXiz-PJ5R0q92QSq3d_GeQ][ip-10-0-0-181.ec2.internal][inet[/10.0.0.181:9300]]{aws_availability_zone=us-east-1a,
max_local_storage_nodes=1},}, reason:
zen-disco-node_failed([elasticsearch-i2-2][fXiz-PJ5R0q92QSq3d_GeQ][ip-10-0-0-181.ec2.internal][inet[/10.0.0.181:9300]]{aws_availability_zone=us-east-1a,
max_local_storage_nodes=1}), reason transport disconnected (with verified
connect)

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/02f9fcfe-1fa5-41d0-b090-3107a6a65ff6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Justin Zhu) #2

Here's log of GC

[2014-05-13 18:09:25,912][WARN ][monitor.jvm ]
[elasticsearch-i2-1] [gc][young][404487][53746] duration [1s], collections
[1]/[1s], total [1s]/[33.7m], memory [10.4gb]->[10.6gb]/[17.7gb], all_pools
{[young] [48.9mb]->[12.9mb]/[266.2mb]}{[survivor]
[33.2mb]->[33.2mb]/[33.2mb]}{[old] [10.4gb]->[10.6gb]/[17.5gb]}
[2014-05-13 18:09:26,949][WARN ][monitor.jvm ]
[elasticsearch-i2-1] [gc][young][404488][53747] duration [1s], collections
[1]/[1s], total [1s]/[33.8m], memory [10.6gb]->[10.9gb]/[17.7gb], all_pools
{[young] [12.9mb]->[24.7mb]/[266.2mb]}{[survivor]
[33.2mb]->[33.2mb]/[33.2mb]}{[old] [10.6gb]->[10.9gb]/[17.5gb]}
[2014-05-13 18:09:29,311][WARN ][monitor.jvm ]
[elasticsearch-i2-1] [gc][young][404489][53749] duration [2.2s],
collections [2]/[2.3s], total [2.2s]/[33.8m], memory
[10.9gb]->[11.5gb]/[17.7gb], all_pools {[young]
[24.7mb]->[126.1mb]/[266.2mb]}{[survivor]
[33.2mb]->[33.2mb]/[33.2mb]}{[old] [10.9gb]->[11.4gb]/[17.5gb]}
[2014-05-13 18:09:32,490][WARN ][monitor.jvm ]
[elasticsearch-i2-1] [gc][young][404490][53751] duration [3.1s],
collections [2]/[3.1s], total [3.1s]/[33.8m], memory
[11.5gb]->[11.9gb]/[17.7gb], all_pools {[young]
[126.1mb]->[10mb]/[266.2mb]}{[survivor] [33.2mb]->[33.2mb]/[33.2mb]}{[old]
[11.4gb]->[11.9gb]/[17.5gb]}
[2014-05-13 18:09:34,148][WARN ][monitor.jvm ]
[elasticsearch-i2-1] [gc][young][404491][53752] duration [1.6s],
collections [1]/[1.6s], total [1.6s]/[33.9m], memory
[11.9gb]->[12.2gb]/[17.7gb], all_pools {[young]
[10mb]->[17.8mb]/[266.2mb]}{[survivor] [33.2mb]->[33.2mb]/[33.2mb]}{[old]
[11.9gb]->[12.1gb]/[17.5gb]}
[2014-05-13 18:09:35,593][WARN ][monitor.jvm ]
[elasticsearch-i2-1] [gc][young][404492][53753] duration [1.4s],
collections [1]/[1.4s], total [1.4s]/[33.9m], memory
[12.2gb]->[12.5gb]/[17.7gb], all_pools {[young]
[17.8mb]->[87.3mb]/[266.2mb]}{[survivor] [33.2mb]->[33.2mb]/[33.2mb]}{[old]
[12.1gb]->[12.4gb]/[17.5gb]}
[2014-05-13 18:09:36,986][WARN ][monitor.jvm ]
[elasticsearch-i2-1] [gc][young][404493][53754] duration [1.3s],
collections [1]/[1.3s], total [1.3s]/[33.9m], memory
[12.5gb]->[12.8gb]/[17.7gb], all_pools {[young]
[87.3mb]->[95.4mb]/[266.2mb]}{[survivor] [33.2mb]->[33.2mb]/[33.2mb]}{[old]
[12.4gb]->[12.6gb]/[17.5gb]}
[2014-05-13 18:09:38,321][WARN ][monitor.jvm ]
[elasticsearch-i2-1] [gc][young][404494][53755] duration [1.2s],
collections [1]/[1.3s], total [1.2s]/[33.9m], memory
[12.8gb]->[12.9gb]/[17.7gb], all_pools {[young]
[95.4mb]->[8.2mb]/[266.2mb]}{[survivor] [33.2mb]->[33.2mb]/[33.2mb]}{[old]
[12.6gb]->[12.9gb]/[17.5gb]}
[2014-05-13 18:09:39,760][WARN ][monitor.jvm ]
[elasticsearch-i2-1] [gc][young][404495][53756] duration [1.3s],
collections [1]/[1.4s], total [1.3s]/[34m], memory
[12.9gb]->[13.3gb]/[17.7gb], all_pools {[young]
[8.2mb]->[149.6mb]/[266.2mb]}{[survivor] [33.2mb]->[33.2mb]/[33.2mb]}{[old]
[12.9gb]->[13.1gb]/[17.5gb]}
[2014-05-13 18:09:42,444][WARN ][monitor.jvm ]
[elasticsearch-i2-1] [gc][young][404496][53758] duration [2.6s],
collections [2]/[2.6s], total [2.6s]/[34m], memory
[13.3gb]->[13.7gb]/[17.7gb], all_pools {[young]
[149.6mb]->[99.5mb]/[266.2mb]}{[survivor]
[33.2mb]->[33.2mb]/[33.2mb]}{[old] [13.1gb]->[13.6gb]/[17.5gb]}
[2014-05-13 18:09:43,842][WARN ][monitor.jvm ]
[elasticsearch-i2-1] [gc][young][404497][53759] duration [1.3s],
collections [1]/[1.3s], total [1.3s]/[34m], memory
[13.7gb]->[14.1gb]/[17.7gb], all_pools {[young]
[99.5mb]->[240.5mb]/[266.2mb]}{[survivor]
[33.2mb]->[33.2mb]/[33.2mb]}{[old] [13.6gb]->[13.9gb]/[17.5gb]}
[2014-05-13 18:09:45,018][WARN ][monitor.jvm ]
[elasticsearch-i2-1] [gc][young][404498][53760] duration [1.1s],
collections [1]/[1.1s], total [1.1s]/[34m], memory
[14.1gb]->[14.3gb]/[17.7gb], all_pools {[young]
[240.5mb]->[147.9mb]/[266.2mb]}{[survivor]
[33.2mb]->[33.2mb]/[33.2mb]}{[old] [13.9gb]->[14.1gb]/[17.5gb]}
[2014-05-13 18:09:46,694][INFO ][monitor.jvm ]
[elasticsearch-i2-1] [gc][young][404499][53762] duration [1.6s],
collections [2]/[1.6s], total [1.6s]/[34.1m], memory
[14.3gb]->[14.7gb]/[17.7gb], all_pools {[young]
[147.9mb]->[64.1mb]/[266.2mb]}{[survivor]
[33.2mb]->[33.2mb]/[33.2mb]}{[old] [14.1gb]->[14.6gb]/[17.5gb]}
[2014-05-13 18:09:48,287][INFO ][monitor.jvm ]
[elasticsearch-i2-1] [gc][young][404500][53764] duration [1.5s],
collections [2]/[1.5s], total [1.5s]/[34.1m], memory
[14.7gb]->[15.2gb]/[17.7gb], all_pools {[young]
[64.1mb]->[4.2mb]/[266.2mb]}{[survivor] [33.2mb]->[33.2mb]/[33.2mb]}{[old]
[14.6gb]->[15.1gb]/[17.5gb]}
[2014-05-13 18:09:49,862][INFO ][monitor.jvm ]
[elasticsearch-i2-1] [gc][young][404501][53766] duration [1.4s],
collections [2]/[1.5s], total [1.4s]/[34.1m], memory
[15.2gb]->[15.8gb]/[17.7gb], all_pools {[young]
[4.2mb]->[136.5mb]/[266.2mb]}{[survivor] [33.2mb]->[33.2mb]/[33.2mb]}{[old]
[15.1gb]->[15.7gb]/[17.5gb]}
[2014-05-13 18:09:51,395][INFO ][monitor.jvm ]
[elasticsearch-i2-1] [gc][young][404502][53768] duration [1.4s],
collections [2]/[1.5s], total [1.4s]/[34.1m], memory
[15.8gb]->[16.3gb]/[17.7gb], all_pools {[young]
[136.5mb]->[129.7mb]/[266.2mb]}{[survivor]
[33.2mb]->[33.2mb]/[33.2mb]}{[old] [15.7gb]->[16.2gb]/[17.5gb]}
[2014-05-13 18:09:58,923][INFO ][monitor.jvm ]
[elasticsearch-i2-1] [gc][old][404504][24] duration [5.8s], collections
[1]/[6.1s], total [5.8s]/[15.6s], memory [16.9gb]->[13.8gb]/[17.7gb],
all_pools {[young] [149.7mb]->[59.8mb]/[266.2mb]}{[survivor]
[33.2mb]->[0b]/[33.2mb]}{[old] [16.7gb]->[13.7gb]/[17.5gb]}
[2014-05-13 18:10:06,487][INFO ][monitor.jvm ]
[elasticsearch-i2-1] [gc][old][404506][25] duration [5.8s], collections
[1]/[6.5s], total [5.8s]/[21.5s], memory [16.1gb]->[17.6gb]/[17.7gb],
all_pools {[young] [69mb]->[215.2mb]/[266.2mb]}{[survivor]
[33.2mb]->[0b]/[33.2mb]}{[old] [16gb]->[17.4gb]/[17.5gb]}
[2014-05-13 18:10:17,047][INFO ][monitor.jvm ]
[elasticsearch-i2-1] [gc][old][404508][27] duration [5.7s], collections
[1]/[5.7s], total [5.7s]/[32s], memory [17.6gb]->[17.7gb]/[17.7gb],
all_pools {[young] [120.9mb]->[266.2mb]/[266.2mb]}{[survivor]
[0b]->[26.2mb]/[33.2mb]}{[old] [17.5gb]->[17.5gb]/[17.5gb]}
[2014-05-13 18:14:10,337][INFO ][monitor.jvm ]
[elasticsearch-i2-1] [gc][old][404511][107] duration [5.6s], collections
[1]/[6.3s], total [5.6s]/[4.4m], memory [16.4gb]->[15.2gb]/[17.7gb],
all_pools {[young] [146.9mb]->[23.6mb]/[266.2mb]}{[survivor]
[0b]->[0b]/[33.2mb]}{[old] [16.3gb]->[15.2gb]/[17.5gb]}
[2014-05-13 18:14:17,014][INFO ][monitor.jvm ]
[elasticsearch-i2-1] [gc][old][404513][108] duration [5.1s], collections
[1]/[5.6s], total [5.1s]/[4.4m], memory [16.8gb]->[17.5gb]/[17.7gb],
all_pools {[young] [245.2mb]->[83.2mb]/[266.2mb]}{[survivor]
[33.2mb]->[0b]/[33.2mb]}{[old] [16.5gb]->[17.4gb]/[17.5gb]}

On Tuesday, May 13, 2014 3:28:33 PM UTC-7, Justin Zhu wrote:

We have a 3 node cluster, each with 30gb total memory, 18gb allocated to
elasticsearch and replicas set at 2. Our largest index is 6GB.

After running for a few days, the cluster would go down with Java out of
heap errors. We currently have a multi-get aggregation that issues 40
requests on the same index, to get unique counts for a list of document
types, it's quite unoptimized atm as it touches the whole index. Query
below.

{
"size" : 0,
"query" : {
"filtered" : {
"query" : {
"match_all" : { }
},
"filter" : {
"bool" : {
"must" : {
"term" : {
"campaignId" : 1914
}
}
}
}
}
},
"aggregations" : {
"distinct_count" : {
"cardinality" : {
"field" : "email",
"precision_threshold" : 40000
}
}
}
}

Since there's 18GB of memory, I'd expected it not to fail. Is there a
setting with cache eviction we need to set?

Here's what we see in the logs:

[2014-05-13 18:17:36,653][WARN ][cluster.action.shard ]
[elasticsearch-i2-1] [prod-project-55][1] received shard failed for
[prod-project-55][1], node[5tMu5N29SX6-4moapYi9kg], [P], s[STARTED],
indexUUID [na], reason [master
[elasticsearch-i2-1][-RkMYiRzRjGRYyfmotvJ2A][ip-10-0-0-97.ec2.internal][inet[/10.0.0.97:9300]]{aws_availability_zone=us-east-1a,
max_local_storage_nodes=1} marked shard as started, but shard has not been
created, mark shard as failed]
[2014-05-13 18:17:36,653][WARN ][cluster.action.shard ]
[elasticsearch-i2-1] [prod-project-53][1] received shard failed for
[prod-project-53][1], node[5tMu5N29SX6-4moapYi9kg], [P], s[STARTED],
indexUUID [na], reason [master
[elasticsearch-i2-1][-RkMYiRzRjGRYyfmotvJ2A][ip-10-0-0-97.ec2.internal][inet[/10.0.0.97:9300]]{aws_availability_zone=us-east-1a,
max_local_storage_nodes=1} marked shard as started, but shard has not been
created, mark shard as failed]
[2014-05-13 18:17:36,654][WARN ][cluster.action.shard ]
[elasticsearch-i2-1] [prod-project-55][4] received shard failed for
[prod-project-55][4], node[5tMu5N29SX6-4moapYi9kg], [P], s[STARTED],
indexUUID [na], reason [master
[elasticsearch-i2-1][-RkMYiRzRjGRYyfmotvJ2A][ip-10-0-0-97.ec2.internal][inet[/10.0.0.97:9300]]{aws_availability_zone=us-east-1a,
max_local_storage_nodes=1} marked shard as started, but shard has not been
created, mark shard as failed]
[2014-05-13 18:18:10,599][DEBUG][action.admin.indices.create]
[elasticsearch-i2-1] [14851415c4714c98a447b688af2ead39] failed to create
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException:
failed to process cluster event (acquire index lock) within 30s
at
org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$1.run(MetaDataCreateIndexService.java:141)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
[2014-05-13 18:18:10,600][DEBUG][action.search.type ]
[elasticsearch-i2-1] [prod-project-52-5-1][0],
node[-RkMYiRzRjGRYyfmotvJ2A], [R], s[STARTED]: Failed to execute
[org.elasticsearch.action.search.SearchRequest@7d8077d2] lastShard [true]
org.elasticsearch.ElasticsearchException: Java heap space
at
org.elasticsearch.ExceptionsHelper.convertToRuntime(ExceptionsHelper.java:37)
at
org.elasticsearch.search.SearchService.createContext(SearchService.java:531)
at
org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:480)
at
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:252)
at
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:202)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.sendExecuteFirstPhase(TransportSearchQueryThenFetchAction.java:80)
at
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:216)
at
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$4.run(TransportSearchTypeAction.java:296)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.OutOfMemoryError: Java heap space
[2014-05-13 18:18:16,694][WARN ][transport ]
[elasticsearch-i2-1] Received response for a request that has timed out,
sent [45926ms] ago, timed out [1024ms] ago, action
[/cluster/nodes/indices/shard/store/n], node
[[es3][5tMu5N29SX6-4moapYi9kg][ip-10-0-0-150.ec2.internal][inet[/10.0.0.150:9300]]{aws_availability_zone=us-east-1a,
max_local_storage_nodes=1}], id [5477680]
[2014-05-13 18:18:17,335][INFO ][cluster.service ]
[elasticsearch-i2-1] removed
{[elasticsearch-i2-2][fXiz-PJ5R0q92QSq3d_GeQ][ip-10-0-0-181.ec2.internal][inet[/10.0.0.181:9300]]{aws_availability_zone=us-east-1a,
max_local_storage_nodes=1},}, reason:
zen-disco-node_failed([elasticsearch-i2-2][fXiz-PJ5R0q92QSq3d_GeQ][ip-10-0-0-181.ec2.internal][inet[/10.0.0.181:9300]]{aws_availability_zone=us-east-1a,
max_local_storage_nodes=1}), reason transport disconnected (with verified
connect)

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3c9b125f-71c6-4b25-b4c6-f1a8368678eb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #3