One node cluster stuck on initializing_shards


#1

Hi,

I have a 2.4.4 cluster with a single server/node (esnode1) containing only one 220GB index with 1 shard and zero replicas.

The index was responding fine, but after I cleanly restarted the server (ec2 with 2cpu 4GB RAM 500GB SSD) the cluster state is stuck on red with "initializing_shards" = 1 but no CPU or disk usage (the system is idle and not swapping) for quite a while now.

I've already raised indices.recovery.max_bytes_per_sec to 50mb, and tried the instructions at https://www.elastic.co/guide/en/elasticsearch/guide/current/_rolling_restarts.html, without success.

Right now, I'm at a loss as to how to debug or understand this, as the logs (below) seem pretty much normal, any hints?

Edit: the behaviour described occurs if I set a 2GB heap for ES. However, with a 3GB heap the cluster status changes to green seconds after restart. Any hints on what could be causing this?

/cluster/_health is
{
  "cluster_name" : "escluster1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 1,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 0.0
}

This is the log following a restart:

[2017-05-04 15:00:37,975][INFO ][node                     ] [esnode1] version[2.4.4], pid[2761], build[fcbb46d/2017-01-03T11:33:16Z]
[2017-05-04 15:00:37,976][INFO ][node                     ] [esnode1] initializing ...
[2017-05-04 15:00:38,534][INFO ][plugins                  ] [esnode1] modules [reindex, lang-expression, lang-groovy], plugins [], sites []
[2017-05-04 15:00:38,563][INFO ][env                      ] [esnode1] using [1] data paths, mounts [[/mnt/esdata2 (/dev/xvdh1)]], net usable_space [226.3gb], net total_space [492gb], spins? [no], types [ext4]
[2017-05-04 15:00:38,563][INFO ][env                      ] [esnode1] heap size [1.9gb], compressed ordinary object pointers [true]
[2017-05-04 15:00:40,379][INFO ][node                     ] [esnode1] initialized
[2017-05-04 15:00:40,380][INFO ][node                     ] [esnode1] starting ...
[2017-05-04 15:00:40,501][INFO ][transport                ] [esnode1] publish_address {127.0.0.1:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300}
[2017-05-04 15:00:40,506][INFO ][discovery                ] [esnode1] escluster1/sv3aHhUjSyueq5N4_w14mQ
[2017-05-04 15:00:43,565][INFO ][cluster.service          ] [esnode1] new_master {esnode1}{sv3aHhUjSyueq5N4_w14mQ}{127.0.0.1}{127.0.0.1:9300}, reason: zen-disco-join(elected_as_master, [0] joins received)
[2017-05-04 15:00:43,595][INFO ][indices.recovery         ] [esnode1] updating [indices.recovery.max_bytes_per_sec] from [40mb] to [50mb]
[2017-05-04 15:00:43,631][INFO ][http                     ] [esnode1] publish_address {127.0.0.1:9200}, bound_addresses {[::1]:9200}, {127.0.0.1:9200}
[2017-05-04 15:00:43,632][INFO ][node                     ] [esnode1] started
[2017-05-04 15:00:43,651][INFO ][gateway                  ] [esnode1] recovered [1] indices into cluster_state

Edit: switching the log level to DEBUG, with a 2GB heap, the cluster state remains "red" and I can see the following message being logged repeatedly every 30 seconds:

[2017-05-10 15:58:45,985][DEBUG][index.shard ] [esnode1] [myIndex][0] updateBufferSize: engine is closed; skipping
[2017-05-10 15:59:15,985][DEBUG][indices.memory ] [esnode1] recalculating shard indexing buffer, total is [203.1mb] with [1] active shards, each shard set to indexing=[203.1mb], translog=[64kb]
[2017-05-10 15:59:15,990][DEBUG][index.shard ] [esnode1] [myIndex][0] updateBufferSize: engine is closed; skipping
[2017-05-10 15:59:45,990][DEBUG][indices.memory ] [esnode1] recalculating shard indexing buffer, total is [203.1mb] with [1] active shards, each shard set to indexing=[203.1mb], translog=[64kb]
[2017-05-10 15:59:45,997][DEBUG][index.shard ] [esnode1] [myIndex][0] updateBufferSize: engine is closed; skipping
[2017-05-10 16:00:15,997][DEBUG][indices.memory ] [esnode1] recalculating shard indexing buffer, total is [203.1mb] with [1] active shards, each shard set to indexing=[203.1mb], translog=[64kb]


(Simon Willnauer) #2

what's the output of curl -XGET 'localhost:9200/_cat/recovery?v&pretty' --> https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-recovery.html


#3

Here's the formatted output of _cat/recovery

  • index myIndex
  • shard 0
  • time 112517
  • type store
  • stage translog
  • source_host 127.0.0.1
  • target_host 127.0.0.1
  • repository n/a
  • snapshot n/a
  • files 0
  • files_percent 100.0%
  • bytes 0
  • bytes_percent 100.0%
  • total_files 274
  • total_bytes 2,34847E+11
  • translog 0
  • translog_percent -1.0%
  • total_translog -1

#4

I fixed this by setting a fixed name for the node and raising the heap size to 3GB to accomodate the static segment data reported at "memory_in_bytes" : 2272548617 (2272548617 wouldn't fit a 2GB heap). The cluster status now turns green seconds after starting or restarting.


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.