Elasticsearch 1.4.4 sporatically crashes

I have a single node instance of Elastic search 1.4.4 running and it has been sporadically crashing and I have not been able to identify the root cause. Here are the stats for my server and for my elasticsearch instance:

java - jdk1.8.0_25
ElasticSearch - 1.4.4
ES_HEAP_SIZE=16g
Marvel - 1.1.0

It had been running fine up until last weekend (4/2/2017) when all of the sudden it started crashing after 1-3 hours of up time. I have a couple suspects but I can not pin point what the root cause is.

  1. The server is currently under a heavy load. Could be that our hardware can't handle it.
  2. We have one extremely large index, 257 million documents in total I am not sure if this could cause problems.
  3. Something has caused our indexes to become corrupted, judging from the logs which I have attached. Marvel particular appears to be having trouble creating any indexes, getting repeated UnavailableShardsException
  4. We have over 10,000 shards, with 5436 active and 5446 replicas for just of 1000 indexes.

I will post an abbreviated log from the last crash below this. Any info that could point me in the right direction would be greatly appreciated. Can provide more info if needed.

[2017-04-05 12:47:18,893][INFO ][node ] [ki003] version[1.4.4], pid[28350], build[c88f77f/2015-02-19T13:05:36Z]
[2017-04-05 12:47:18,893][INFO ][node ] [ki003] initializing ...
[2017-04-05 12:47:18,914][INFO ][plugins ] [ki003] loaded [marvel], sites [marvel, head, migration]
[2017-04-05 12:47:20,403][WARN ][monitor.jvm ] [ki003] ignoring gc_threshold for [young], missing warn/info/debug values
[2017-04-05 12:47:20,404][WARN ][monitor.jvm ] [ki003] ignoring gc_threshold for [old], missing warn/info/debug values
[2017-04-05 12:48:00,419][INFO ][node ] [ki003] initialized
[2017-04-05 12:48:00,419][INFO ][node ] [ki003] starting ...
[2017-04-05 12:48:00,568][INFO ][transport ] [ki003] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/172.18.146.131:9300]}
[2017-04-05 12:48:00,586][INFO ][discovery ] [ki003] elasticsearch_intellinx_prod/Z48wavFPSkONsN3T5mPacA
[2017-04-05 12:48:04,363][INFO ][cluster.service ] [ki003] new_master [ki003][Z48wavFPSkONsN3T5mPacA][intxpprki003][inet[/172.18.146.131:9300]], reason: zen-disco-join (elected_as_master)
[2017-04-05 12:48:04,658][INFO ][http ] [ki003] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/172.18.146.131:9200]}
[2017-04-05 12:48:04,658][INFO ][node ] [ki003] started
[2017-04-05 12:48:06,361][DEBUG][action.search.type ] [ki003] All shards failed for phase: [query]
[2017-04-05 12:48:06,442][DEBUG][action.search.type ] [ki003] All shards failed for phase: [query]
[2017-04-05 12:48:07,214][DEBUG][action.search.type ] [ki003] All shards failed for phase: [query]
[2017-04-05 12:48:07,267][DEBUG][action.search.type ] [ki003] All shards failed for phase: [query]
[2017-04-05 12:48:07,685][DEBUG][action.search.type ] [ki003] All shards failed for phase: [query]
[2017-04-05 12:48:07,738][DEBUG][action.search.type ] [ki003] All shards failed for phase: [query]
[2017-04-05 12:48:07,799][DEBUG][action.search.type ] [ki003] All shards failed for phase: [query]
[2017-04-05 12:48:07,925][DEBUG][action.search.type ] [ki003] All shards failed for phase: [query]
[2017-04-05 12:48:07,969][INFO ][gateway ] [ki003] recovered [1089] indices into cluster_state
[2017-04-05 12:48:07,970][DEBUG][action.admin.indices.create] [ki003] [efax_consumers] failed to create
org.elasticsearch.indices.InvalidIndexNameException: [efax_consumers] Invalid index name [efax_consumers], already exists as alias
at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validateIndexName(MetaDataCreateIndexService.java:197)
at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validate(MetaDataCreateIndexService.java:559)
at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.access$200(MetaDataCreateIndexService.java:87)
at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$2.execute(MetaDataCreateIndexService.java:243)
at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:352)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:184)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:154)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2017-04-05 12:48:07,976][DEBUG][action.search.type ] [ki003] All shards failed for phase: [query]
[2017-04-05 12:48:07,979][DEBUG][action.search.type ] [ki003] All shards failed for phase: [query]
....
[2017-04-05 12:50:11,761][ERROR][marvel.agent.exporter ] [ki003] create failure (index:[.marvel-2017.04.05] type: [node_event]): UnavailableShardsException[[.marvel-2017.04.05][0] Primary shard is not active or isn't assigned is a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@376fe9b7]
[2017-04-05 12:50:11,761][ERROR][marvel.agent.exporter ] [ki003] create failure (index:[.marvel-2017.04.05] type: [cluster_event]): UnavailableShardsException[[.marvel-2017.04.05][0] Primary shard is not active or isn't assigned is a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@376fe9b7]
[2017-04-05 12:50:11,761][ERROR][marvel.agent.exporter ] [ki003] create failure (index:[.marvel-2017.04.05] type: [shard_event]): UnavailableShardsException[[.marvel-2017.04.05][0] Primary shard is not active or isn't assigned is a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@376fe9b7]
[2017-04-05 12:50:11,761][ERROR][marvel.agent.exporter ] [ki003] create failure (index:[.marvel-2017.04.05] type: [shard_event]): UnavailableShardsException[[.marvel-2017.04.05][0] Primary shard is not active or isn't assigned is a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@376fe9b7]
....
[2017-04-05 12:56:56,982][DEBUG][action.bulk ] [ki003] observer timed out. notifying listener. timeout setting [1m], time since start [1m]
[2017-04-05 12:59:10,804][DEBUG][action.bulk ] [ki003] observer timed out. notifying listener. timeout setting [1m], time since start [1m]
[2017-04-05 12:59:10,804][DEBUG][action.bulk ] [ki003] observer timed out. notifying listener. timeout setting [1m], time since start [1m]
[2017-04-05 12:59:10,804][DEBUG][action.bulk ] [ki003] observer timed out. notifying listener. timeout setting [1m], time since start [1m]
[2017-04-05 12:59:10,805][DEBUG][action.bulk ] [ki003] observer timed out. notifying listener. timeout setting [1m], time since start [1m]
[2017-04-05 12:59:10,805][DEBUG][action.bulk ] [ki003] observer timed out. notifying listener. timeout setting [1m], time since start [1m]
[2017-04-05 13:00:12,227][DEBUG][action.bulk ] [ki003] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2017-04-05 13:00:12,227][DEBUG][action.bulk ] [ki003] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2017-04-05 13:00:12,264][DEBUG][action.bulk ] [ki003] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2017-04-05 13:00:12,264][DEBUG][action.bulk ] [ki003] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]

Hey,

this is very hard to tell with this kind of information. Also I am not too sure how much sense it makes to debug further issues, given that you have 10k shards (ok, 'only' 5k shards) on a single node, which you shouldnt do at all. Plus you are using a really old version of Elasticsearch, released more than two years ago.

You should check stuff like node stats, node infos, garbage collection, hot threads and pinpoint down the issue further, but even if you know it, you need to invest some time to get a stable system.

--Alex

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.