Elasticsearch 1.4.4 sporatically crashes

dmcpherson · April 5, 2017, 5:37pm

I have a single node instance of Elastic search 1.4.4 running and it has been sporadically crashing and I have not been able to identify the root cause. Here are the stats for my server and for my elasticsearch instance:

java - jdk1.8.0_25
ElasticSearch - 1.4.4
ES_HEAP_SIZE=16g
Marvel - 1.1.0

It had been running fine up until last weekend (4/2/2017) when all of the sudden it started crashing after 1-3 hours of up time. I have a couple suspects but I can not pin point what the root cause is.

The server is currently under a heavy load. Could be that our hardware can't handle it.
We have one extremely large index, 257 million documents in total I am not sure if this could cause problems.
Something has caused our indexes to become corrupted, judging from the logs which I have attached. Marvel particular appears to be having trouble creating any indexes, getting repeated UnavailableShardsException
We have over 10,000 shards, with 5436 active and 5446 replicas for just of 1000 indexes.

I will post an abbreviated log from the last crash below this. Any info that could point me in the right direction would be greatly appreciated. Can provide more info if needed.

dmcpherson · April 5, 2017, 5:47pm

[2017-04-05 12:47:18,893][INFO ][node [2017-04-05 12:47:18,893][INFO ][node [2017-04-05 12:47:18,914][INFO ][plugins [2017-04-05 12:47:20,403][WARN ][monitor.jvm [2017-04-05 12:47:20,404][WARN ][monitor.jvm [2017-04-05 12:48:00,419][INFO ][node [2017-04-05 12:48:00,419][INFO ][node [2017-04-05 12:48:00,568][INFO ][transport [2017-04-05 12:48:00,586][INFO ][discovery [2017-04-05 12:48:04,363][INFO ][cluster.service [2017-04-05 12:48:04,658][INFO ][http [2017-04-05 12:48:04,658][INFO ][node [2017-04-05 12:48:06,361][DEBUG][action.search.type [2017-04-05 12:48:06,442][DEBUG][action.search.type [2017-04-05 12:48:07,214][DEBUG][action.search.type [2017-04-05 12:48:07,267][DEBUG][action.search.type [2017-04-05 12:48:07,685][DEBUG][action.search.type [2017-04-05 12:48:07,738][DEBUG][action.search.type [2017-04-05 12:48:07,799][DEBUG][action.search.type [2017-04-05 12:48:07,925][DEBUG][action.search.type [2017-04-05 12:48:07,969][INFO ][gateway [2017-04-05 12:48:07,970][DEBUG][action.admi org.elasticsearch.indices.InvalidIndexNameException: at org.elasticsearch.cluster.metadata.MetaDa at org.elasticsearch.cluster.metadata.MetaDa at org.elasticsearch.cluster.metadata.MetaDa at org.elasticsearch.cluster.metadata.MetaDa at org.elasticsearch.cluster.service.Interna at org.elasticsearch.common.util.concurrent. at org.elasticsearch.common.util.concurrent. at java.util.concurrent.ThreadPoolExecutor.r at java.util.concurrent.ThreadPoolExecutor$W at java.lang.Thread.run(Thread.java:745)
[2017-04-05 12:48:07,976][DEBUG][action.search.type [2017-04-05 12:48:07,979][DEBUG][action.search.type ....
[2017-04-05 12:50:11,761][ERROR][marvel.agent.exporter [2017-04-05 12:50:11,761][ERROR][marvel.agent.exporter [2017-04-05 12:50:11,761][ERROR][marvel.agent.exporter [2017-04-05 12:50:11,761][ERROR][marvel.agent.exporter ....
[2017-04-05 12:56:56,982][DEBUG][action.bulk [2017-04-05 12:59:10,804][DEBUG][action.bulk [2017-04-05 12:59:10,804][DEBUG][action.bulk [2017-04-05 12:59:10,804][DEBUG][action.bulk [2017-04-05 12:59:10,805][DEBUG][action.bulk [2017-04-05 12:59:10,805][DEBUG][action.bulk [2017-04-05 13:00:12,227][DEBUG][action.bulk [2017-04-05 13:00:12,227][DEBUG][action.bulk [2017-04-05 13:00:12,264][DEBUG][action.bulk [2017-04-05 13:00:12,264][DEBUG][action.bulk ] [ki003] version[1.4.4], pid[28350], build[c88f77f/2015-02-19T13:05:36Z]
] [ki003] initializing ...
] [ki003] loaded [marvel], sites [marvel, head, migration]
] [ki003] ignoring gc_threshold for [young], missing warn/info/debug values
] [ki003] ignoring gc_threshold for [old], missing warn/info/debug values
] [ki003] initialized
] [ki003] starting ...
] [ki003] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/172.18.146.131:9300]}
] [ki003] elasticsearch_intellinx_prod/Z48wavFPSkONsN3T5mPacA
] [ki003] new_master [ki003][Z48wavFPSkONsN3T5mPacA][intxpprki003][inet[/172.18.146.131:9300]], reason: zen-disco-join (elected_as_master)
] [ki003] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/172.18.146.131:9200]}
] [ki003] started
] [ki003] All shards failed for phase: [query]
] [ki003] All shards failed for phase: [query]
] [ki003] All shards failed for phase: [query]
] [ki003] All shards failed for phase: [query]
] [ki003] All shards failed for phase: [query]
] [ki003] All shards failed for phase: [query]
] [ki003] All shards failed for phase: [query]
] [ki003] All shards failed for phase: [query]
] [ki003] recovered [1089] indices into cluster_state
n.indices.create] [ki003] [efax_consumers] failed to create
[efax_consumers] Invalid index name [efax_consumers], already exists as alias
taCreateIndexService.validateIndexName(MetaDataCreateIndexService.java:197)
taCreateIndexService.validate(MetaDataCreateIndexService.java:559)
taCreateIndexService.access$200(MetaDataCreateIndexService.java:87)
taCreateIndexService$2.execute(MetaDataCreateIndexService.java:243)
lClusterService$UpdateTask.run(InternalClusterService.java:352)
PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:184)
PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:154)
unWorker(ThreadPoolExecutor.java:1145)
orker.run(ThreadPoolExecutor.java:615)
] [ki003] All shards failed for phase: [query]
] [ki003] All shards failed for phase: [query]
] [ki003] create failure (index:[.marvel-2017.04.05] type: [node_event]): UnavailableShardsException[[.marvel-2017.04.05][0] Primary shard is not active or isn't assigned is a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@376fe9b7]
] [ki003] create failure (index:[.marvel-2017.04.05] type: [cluster_event]): UnavailableShardsException[[.marvel-2017.04.05][0] Primary shard is not active or isn't assigned is a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@376fe9b7]
] [ki003] create failure (index:[.marvel-2017.04.05] type: [shard_event]): UnavailableShardsException[[.marvel-2017.04.05][0] Primary shard is not active or isn't assigned is a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@376fe9b7]
] [ki003] create failure (index:[.marvel-2017.04.05] type: [shard_event]): UnavailableShardsException[[.marvel-2017.04.05][0] Primary shard is not active or isn't assigned is a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@376fe9b7]
] [ki003] observer timed out. notifying listener. timeout setting [1m], time since start [1m]
] [ki003] observer timed out. notifying listener. timeout setting [1m], time since start [1m]
] [ki003] observer timed out. notifying listener. timeout setting [1m], time since start [1m]
] [ki003] observer timed out. notifying listener. timeout setting [1m], time since start [1m]
] [ki003] observer timed out. notifying listener. timeout setting [1m], time since start [1m]
] [ki003] observer timed out. notifying listener. timeout setting [1m], time since start [1m]
] [ki003] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
] [ki003] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
] [ki003] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
] [ki003] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]

spinscale · April 6, 2017, 7:41am

Hey,

this is very hard to tell with this kind of information. Also I am not too sure how much sense it makes to debug further issues, given that you have 10k shards (ok, 'only' 5k shards) on a single node, which you shouldnt do at all. Plus you are using a really old version of Elasticsearch, released more than two years ago.

You should check stuff like node stats, node infos, garbage collection, hot threads and pinpoint down the issue further, but even if you know it, you need to invest some time to get a stable system.

--Alex

system · May 4, 2017, 7:41am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch cause linux kernel crash Elasticsearch	9	1295	July 24, 2019
Frequent shard failures Elasticsearch	7	808	July 20, 2023
A ruined shard caused ES node down Elasticsearch	10	931	December 22, 2017
Frequent shard failures Elasticsearch	8	107	December 30, 2024
ElasticSearch crashes OS? Elasticsearch	13	469	July 6, 2017

Elasticsearch 1.4.4 sporatically crashes

Related topics