Hi,
I have a 2.4.4 cluster with a single server/node (esnode1) containing only one 220GB index with 1 shard and zero replicas.
The index was responding fine, but after I cleanly restarted the server (ec2 with 2cpu 4GB RAM 500GB SSD) the cluster state is stuck on red with "initializing_shards" = 1 but no CPU or disk usage (the system is idle and not swapping) for quite a while now.
I've already raised indices.recovery.max_bytes_per_sec to 50mb, and tried the instructions at Rolling Restarts | Elasticsearch: The Definitive Guide [2.x] | Elastic, without success.
Right now, I'm at a loss as to how to debug or understand this, as the logs (below) seem pretty much normal, any hints?
Edit: the behaviour described occurs if I set a 2GB heap for ES. However, with a 3GB heap the cluster status changes to green seconds after restart. Any hints on what could be causing this?
/cluster/_health is
{
"cluster_name" : "escluster1",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 1,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 0.0
}
This is the log following a restart:
[2017-05-04 15:00:37,975][INFO ][node ] [esnode1] version[2.4.4], pid[2761], build[fcbb46d/2017-01-03T11:33:16Z]
[2017-05-04 15:00:37,976][INFO ][node ] [esnode1] initializing ...
[2017-05-04 15:00:38,534][INFO ][plugins ] [esnode1] modules [reindex, lang-expression, lang-groovy], plugins [], sites []
[2017-05-04 15:00:38,563][INFO ][env ] [esnode1] using [1] data paths, mounts [[/mnt/esdata2 (/dev/xvdh1)]], net usable_space [226.3gb], net total_space [492gb], spins? [no], types [ext4]
[2017-05-04 15:00:38,563][INFO ][env ] [esnode1] heap size [1.9gb], compressed ordinary object pointers [true]
[2017-05-04 15:00:40,379][INFO ][node ] [esnode1] initialized
[2017-05-04 15:00:40,380][INFO ][node ] [esnode1] starting ...
[2017-05-04 15:00:40,501][INFO ][transport ] [esnode1] publish_address {127.0.0.1:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300}
[2017-05-04 15:00:40,506][INFO ][discovery ] [esnode1] escluster1/sv3aHhUjSyueq5N4_w14mQ
[2017-05-04 15:00:43,565][INFO ][cluster.service ] [esnode1] new_master {esnode1}{sv3aHhUjSyueq5N4_w14mQ}{127.0.0.1}{127.0.0.1:9300}, reason: zen-disco-join(elected_as_master, [0] joins received)
[2017-05-04 15:00:43,595][INFO ][indices.recovery ] [esnode1] updating [indices.recovery.max_bytes_per_sec] from [40mb] to [50mb]
[2017-05-04 15:00:43,631][INFO ][http ] [esnode1] publish_address {127.0.0.1:9200}, bound_addresses {[::1]:9200}, {127.0.0.1:9200}
[2017-05-04 15:00:43,632][INFO ][node ] [esnode1] started
[2017-05-04 15:00:43,651][INFO ][gateway ] [esnode1] recovered [1] indices into cluster_state
Edit: switching the log level to DEBUG, with a 2GB heap, the cluster state remains "red" and I can see the following message being logged repeatedly every 30 seconds:
[2017-05-10 15:58:45,985][DEBUG][index.shard ] [esnode1] [myIndex][0] updateBufferSize: engine is closed; skipping
[2017-05-10 15:59:15,985][DEBUG][indices.memory ] [esnode1] recalculating shard indexing buffer, total is [203.1mb] with [1] active shards, each shard set to indexing=[203.1mb], translog=[64kb]
[2017-05-10 15:59:15,990][DEBUG][index.shard ] [esnode1] [myIndex][0] updateBufferSize: engine is closed; skipping
[2017-05-10 15:59:45,990][DEBUG][indices.memory ] [esnode1] recalculating shard indexing buffer, total is [203.1mb] with [1] active shards, each shard set to indexing=[203.1mb], translog=[64kb]
[2017-05-10 15:59:45,997][DEBUG][index.shard ] [esnode1] [myIndex][0] updateBufferSize: engine is closed; skipping
[2017-05-10 16:00:15,997][DEBUG][indices.memory ] [esnode1] recalculating shard indexing buffer, total is [203.1mb] with [1] active shards, each shard set to indexing=[203.1mb], translog=[64kb]