Elasticsearch hangs at 541 shards

Greetings,

this is my setup:
CentOS 7 on VMware ESXi, 8GB RAM
elasticsearch 2.3.5 (via repo)
1 node

When I'm starting the service, elasticsearch hangs at status red.

Here's the output:

curl -XGET http://localhost:9200/_cluster/health?pretty

{
"cluster_name" : "elasticsearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 541,
"active_shards" : 541,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 561,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 49.09255898366606
}

elasticsearch.log reveals nothing uncommon.

Elasticsearch keeps hanging with 541 active and 561 unassigned shards every time I start it. This used to work for months. I suspect it's somehow running into a limit.

I've already fixed the 65536 file limits issue.

Any help would greatly be appreciated.

TIA
Matthias, Marburg/Germany

Hi @mkoch,

Status "red" means that at least one primary shard (and all of its replicas) is missing. Any chance you had at one time at least two nodes running?

I suggest you checkout the section Cluster Health in the Definitive Guide.

Daniel

You have too many shards for a single node

Not at all. It's always been a standalone ELK server.

So there's a max of shards? Haven't seen it in the docu yet. Can you give me a hint (chapter, page, something)?

There is no max limit, but each shard comes at a cost.

Obviously. I just cannot see anything, not in the machine itself, not in a somewhat affected or degraded performance (8 cores, 8GB) nor in the logs. Currently I'm processing 38 GB of logfiles, which results in an elasticsearch directory of 69 GB.

Pretty much at a loss atm... I've rechecked the configuration of both the system and es, made sure that heap size fits (4GB now) and swap is turned off, raised the memory from 8 to 16 GB, added and removed another node which seemed to work well... all to no avail. Es stops at 541 active shards, leaving 561 unassigned. The es log doesn't give any hints what made it hang.

curl localhost:9200/_nodes/stats/process?pretty

{
"cluster_name" : "elasticsearch",
"nodes" : {
"aqObmA7NT-20RFgDUxSYDA" : {
"timestamp" : 1472569546425,
"name" : "node-1",
"transport_address" : "192.168.1.222:9300",
"host" : "192.168.1.222",
"ip" : [ "192.168.1.222:9300", "NONE" ],
"process" : {
"timestamp" : 1472569546425,
"open_file_descriptors" : 11123,
"max_file_descriptors" : 65536,
"cpu" : {
"percent" : 0,
"total_in_millis" : 126650
},
"mem" : {
"total_virtual_in_bytes" : 31962480640
}
}
}
}
}

curl localhost:9200/_nodes/stats?pretty|grep heap

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 11324 100 11324 0 0 7944 0 0:00:01 0:00:01 --:--:-- 7946
"heap_used_in_bytes" : 1002817224,
"heap_used_percent" : 23,
"heap_committed_in_bytes" : 4207738880,
"heap_max_in_bytes" : 4207738880,
"non_heap_used_in_bytes" : 78643016,
"non_heap_committed_in_bytes" : 80228352,

The data I'm working on are logfiles which grow over time, and it's easy to believe that at some point a line was crossed where es couldn't handle the volume any more (see @warkolm 's posts). what confuses me is that I never saw any warning about thresholds being reached, performance going down, ressources being exhausted, any logfile entries... just nothing. It stops working and that's it. What am I missing here?

Matthias

Do all these shards belong to one index or span multiple indices? Are your index(s) configured to have replicas?

And do you get in this red state when starting up the node that was part of the cluster before? Did you have indices before and then restart or did you start on a fresh cluster and try to create an index?

Could you try logging at the DEBUG level and then see if there is anything pertinent to allocation or the AllocationService? Or any shard failed exceptions in the logs?

As I have not set anything referring to multiple indices or replicas (the es configuration has not been changed much at all, it's more or less as it came with the repo) I think the answer to that is no.

The node has never been part of a cluster, it's been a standalone ELK installation.

What confuses me is that it used to work this way for months, until it suddenly stopped for no apparent reason. I somehow doubt that the exhaustion of resources is the reason.

I have raised the loglevel to DEBUG and I'm getting loads of messages, most of them not meaningful to me. As far as I can see, there are no obvious or typical error messages (i. e. the words "fail", "exception" or AllocationService never appear). The word "allocation" appears frequently, typically in this context:

[2016-08-31 09:20:30,603][DEBUG][gateway ] [node-1] [filebeat-2016.07.13][2] found 1 allocations of [filebeat-2016.07.13][2], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-08-31T07:20:23.253Z]], highest version: [46]

[2016-08-31 09:20:30,603][DEBUG][gateway ] [node-1] [filebeat-2016.07.13][2]: throttling allocation [[filebeat-2016.07.13][2], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-08-31T07:20:23.253Z]]] to [[{node-1}{4AHnduKWRTOsh3Y7oXpvLg}{192.168.1.222}{192.168.1.222:9300}]] on primary allocation

Closer to the end of the logfile, lines like this appear:

2016-08-31 09:25:49,860][DEBUG][index.shard ] [node-1] [filebeat-2016.06.22][0] marking shard as inactive (inactive_time=[300000000000]) indexing wise

When I start the service, the node is in condition "red" and stays there.

watch -n3 'curl -XGET http://localhost:9200/_cluster/health?pretty'

shows the index building up, with an intial value of about 1090 unassigned shards, decreasing as the number of active_primary_shards and active_shards increases, until they reach 541. At that point the process stops. So far, I cannot see corresponding lines in the log (probably need to decrease the log level, DEBUG really floods you).

During startup, I am getting a lof of lines like these:

[2016-08-31 10:35:50,037][DEBUG][gateway ] [node-1] [filebeat-2016.05.25][1]: throttling allocation [[filebeat-2016.05.25][1], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-08-31T08:35:34.869Z]]] to [[{node-1}{1hQ6-5ZZQNO5tTmXUZRtCg}{192.168.1.222}{192.168.1.222:9300}]] on primary allocation

The last line in the log is

[2016-08-31 09:26:20,272][DEBUG][index.engine ] [node-1] [filebeat-2016.05.18][4] successfully sync committed. sync id [AVbffdFB3UQToPOebqrr].

This appears to be the final message in the log:

[2016-08-31 10:36:01,893][DEBUG][cluster.service ] [node-1] processing [shard-started ([filebeat-2016.05.12][1], node[1hQ6-5ZZQNO5tTmXUZRtCg], [P], v[47], s[INITIALIZING], a[id=CIcEOdvaS2GufO6B07GtVA], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-08-31T08:35:34.879Z]]), reason [after recovery from store],shard-started ([.kibana][0], node[1hQ6-5ZZQNO5tTmXUZRtCg], [P], v[47], s[INITIALIZING], a[id=kiIcfznfRNih2zsBNo12Xg], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-08-31T08:35:34.880Z]]), reason [after recovery from store],shard-started ([filebeat-2016.05.12][1], node[1hQ6-5ZZQNO5tTmXUZRtCg], [P], v[47], s[INITIALIZING], a[id=CIcEOdvaS2GufO6B07GtVA], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-08-31T08:35:34.879Z]]), reason [master {node-1}{1hQ6-5ZZQNO5tTmXUZRtCg}{192.168.1.222}{192.168.1.222:9300} marked shard as initializing, but shard state is [POST_RECOVERY], mark shard as started],shard-started ([.kibana][0], node[1hQ6-5ZZQNO5tTmXUZRtCg], [P], v[47], s[INITIALIZING], a[id=kiIcfznfRNih2zsBNo12Xg], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-08-31T08:35:34.880Z]]), reason [master {node-1}{1hQ6-5ZZQNO5tTmXUZRtCg}{192.168.1.222}{192.168.1.222:9300} marked shard as initializing, but shard state is [POST_RECOVERY], mark shard as started]]: took 65ms done applying updated cluster_state (version: 272, uuid: DZmfu0JRRpSIsJKUsZUzdw)

then being followed by

[2016-08-31 10:40:01,481][DEBUG][indices.memory ] [node-1] recalculating shard indexing buffer, total is [401.2mb] with [541] active shards, each shard set to indexing=[4mb], translog=[64kb]

every 15 or so seconds.

How about trying to check whether your "data.path" has been changed or these unsigned shards still be stored in your present "data.path"?

data.path has never been changed.

You had to have some configuration change to change the number of shards, or you had to have created multiple indices. Otherwise, its not possible to have as many shards as you do.

You have so many shards so its taking a long time to recover - we throttle recoveries so as to not overwhelm a node. especially during cluster recovery

This is just a heuristic we use to see if we can reduce the indexing/translog buffer sizes if there aren't any active indexing operations taking place.

After this, do you see anything in the logs sending shard started for [....]? You should see this message right after the message above, and if it failed you should then also see failed to send shard started to [...]

First, let me thank you for your detailed reply. I must admit I am totally new to es, I set up an ELK server as per instructions and it just worked for me - until it stopped.

I couldn't tell how many shards were "normal" or could be handled by a single node. I am currently handling 40GB of logfiles which resulted in a node directory of 60GB. Both values don't seem to be totally out of bounds to me, and the machine itself should have sufficient recources to handle data of that size (now 12 CPUs and 16GB RAM). I never noticed extraordinary load on it - I've been monitoring and logging it from the start. Apparently there must be a bottleneck someplace else.

The string sending shard started for never appeared in any logfile until yesterday (it probably only shows up when raising the log level to DEBUG). As of yesterday, lines like

[2016-08-31 10:35:58,035][DEBUG][cluster.action.shard ] [node-1] [filebeat-2016.05.17][3] sending shard started for target shard [[filebeat-2016.05.17][3], node[1hQ6-5ZZQNO5tTmXUZRtCg], [P], v[45], s[INITIALIZING], a[id=o9mdmitqR7GIxqH_oDtZlA], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-08-31T08:35:34.879Z]]], indexUUID [H2twazNzTNyBvo3yT86t8Q], message [after recovery from store]

occur.

I have moved the old directory away and have restarted the es process. After about 18 hours, the process has been building up an new directory of currently 36GB, it's in state "yellow" with 16 active shards. I'll be waiting until it is finished.

The data in the old directory is expendable as we're not in production mode. I'd still like to find out how the ELK server borked up.

This is the current situation:

curl -XGET http://localhost:9200/_cluster/health?pretty

{
"cluster_name" : "elasticsearch",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 16,
"active_shards" : 16,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 16,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 50.0
}

logfile:
[2016-09-01 12:05:39,752][DEBUG][indices.memory ] [node-1] recalculating shard indexing buffer, total is [401.2mb] with [5] active shards, each shard set to indexing=[80.2mb], translog=[64kb]
[

(repeatedly, every 30 seconds or so, that's probably data continuously being fed into es)

At the moment, I can access es, but it's still in condition "yellow".

I would like to add this... under the hood, each shard is a Lucene index resides in its own directory and each index can hold up to 2 billion documents. When you have this many shards, you'll see this many directories created under the hood. I've ingested 2B into a shard before b/c I just wanted to see how it handled indexing and searching. Each shard seems to response very well when it has upto 1B documents (good for indexing, searching, loading, etc) When ES is up and the index is active, in theory, it will "talk" to all the shards, thus it will consume a significant amount of system resources. That's why you'll see the yellow b/c it's still in the process of "talking to all the shards"

Based on what you said above, 12 CPU and 16GB RAM. The system RAM is low b/c you can have the HEAP_SIZE set to 30GB max so think about this resource if you can have the opportunity to upgrade this machine.

I suggest to do a little bit of planning in terms of

  • how many documents (not data size in terms of GB) are you going to index per day?
  • how many documents are you planning to keep in one index?
  • overtime, how many documents or indices do you want to keep around?

Depending on what you have, you'll decide how many shards you should have per index and how many indices you should create and maintain in this machine. For testing, I understand you can create as many indices with as many shards as you want but eventually the system will not be able to handle the load effectively (as you've seen), ES is not a problem here, it's the "limited resources" given to ES.

As a side note, you can also see information about ongoing recoveries using this API: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-recovery.html

Also, I'm not sure what your index settings are, you can get them by doing a get on your index(es): https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-get-index.html

If you have more replicas than number of nodes in your cluster, then you will have a yellow state as not all replicas can be assigned. From the looks of your latest cluster health response you posted, you only have 1 node in your cluster. If you have 1 replica configured (which is the default), then your cluster will be yellow because your one node occupies the primary shard(s) and there is no other node on which to put the replicas (it doesn't make sense to put a replica shard on the same node as the primary - if you lose the node, you lose both, so its redundant).