Indices stuck after recovery from backup

Hi,

I apologize for the large number of edits, I had some trouble using the code blocks since I didn't perceive that an extra carriage return needed to precede the preformatted text.

I have a small elasticsearch cluster that had to be restored from backup, after which I have run into some index issues. I have a private, replica index that is stuck in INITIALIZING, ALLOCATION FAILED status. We can call this issue 1.

Issue 2 is that there are several indices named like ".monitoring-es-7-2019.07.XX" which are UNASSIGNED, CLUSTER_RECOVERED. These, I feel may be rotated out after 15 days or so based on my research, but I wouldn't mind being able to clear this up myself today if possible. I don't think I really need these so they can be deleted if I can find a way, as they are not listed in managed indices in Kibana, I expect because they are hidden/system indices.

GET /_cluster/allocation/explain?pretty gives some pretty nice data, but only for Issue 2.
"allocate_explanation" : "cannot allocate because all found copies of the shard are either stale or corrupt"

Issue 1 looks more like this:

{
  "error": {
    "root_cause": [
      {
        "type": "circuit_breaking_exception",
        "reason": "[parent] Data too large, data for [<transport_request>] would be [1027222334/979.6mb], which is larger than the limit of [986061209/940.3mb], real usage: [1027222120/979.6mb], new bytes reserved: [214/214b]",
        "bytes_wanted": 1027222334,
        "bytes_limit": 986061209,
        "durability": "PERMANENT"
      }
    ],
    "type": "circuit_breaking_exception",
    "reason": "[parent] Data too large, data for [<transport_request>] would be [1027222334/979.6mb], which is larger than the limit of [986061209/940.3mb], real usage: [1027222120/979.6mb], new bytes reserved: [214/214b]",
    "bytes_wanted": 1027222334,
    "bytes_limit": 986061209,
    "durability": "PERMANENT"
  },
  "status": 429
}

I have done some research on this "Data too large" and [<transport_request>], but all the topics point to errors that don't seem to match my use-case. I've also been trying to understand these limits, as I had 8GB allocated to these VMs, but expanded to 16GB and the limit remains at 940.3mb. Should I change this to match my environment more realistically or is there a different issue impacting me here? It's also interesting to me that I see these same errors in logstash.log, which I thought would have nothing to do with the indexing portion.

W/o ?pretty (networking data redacted):

{
  "index" : "private_index",
  "shard" : 0,
  "primary" : true,
  "current_state" : "started",
  "current_node" : {
    "id" : "ihTj7kz6QQewpWM-FMo6CA",
    "name" : "node1",
    "transport_address" : "node1_ip:9300",
    "attributes" : {
      "ml.machine_memory" : "16656506880",
      "ml.max_open_jobs" : "20",
      "xpack.installed" : "true"
    },
    "weight_ranking" : 1
  },
  "can_remain_on_current_node" : "yes",
  "can_rebalance_cluster" : "no",
  "can_rebalance_cluster_decisions" : [
    {
      "decider" : "rebalance_only_when_active",
      "decision" : "NO",
      "explanation" : "rebalancing is not allowed until all replicas in the cluster are active"
    },
    {
      "decider" : "cluster_rebalance",
      "decision" : "NO",
      "explanation" : "the cluster has unassigned shards and cluster setting [cluster.routing.allocation.allow_rebalance] is set to [indices_all_active]"
    }
  ],
  "can_rebalance_to_other_node" : "no",
  "rebalance_explanation" : "rebalancing is not allowed",
  "node_allocation_decisions" : [
    {
      "node_id" : "7LIgwS6NQEuF0H8S-lLZCA",
      "node_name" : "node2",
      "transport_address" : "node2_ip:9300",
      "node_attributes" : {
        "ml.machine_memory" : "16656506880",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "weight_ranking" : 1,
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[private_index][0], node[7LIgwS6NQEuF0H8S-lLZCA], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=iOPv9-x_R26AMQQxMbyF_A], unassigned_info[[reason=ALLOCATION_FAILED], at[2019-07-10T16:39:15.841Z], failed_attempts[1], delayed=false, details[failed shard on node [7LIgwS6NQEuF0H8S-lLZCA]: master {node1}{ihTj7kz6QQewpWM-FMo6CA}{O6ARr-ijQSuaCDIYQ6phFQ}{node1_ip}{node1_ip:9300}{ml.machine_memory=16656506880, ml.max_open_jobs=20, xpack.installed=true} has not removed previously failed shard. resending shard failure], allocation_status[no_attempt]]]"
        }
      ]
    }
  ]
}

No worries, thanks for putting the effort in to get it right. It makes it so much easier to help.

Re. issue 2:

Have you adjusted the amount of memory allocated to the heap of the JVM running Elasticsearch to match? Although you've allocated more memory to your VM, it looks like you're still running Elasticsearch with the default 1GB heap size.

Re. issue 1, the allocation explain output is telling us that this (primary) shard is healthy ("current_state" : "started") so I'm not sure what the problem there is. What was the command you used to get that output? Did you perhaps say "primary": true when you meant "primary": false?

To be clear, the primary shard is on node1 and working just fine. I'm trying to initiate the replica shard, which is giving me all of these errors. is "primary": false the incorrect syntax?

I will look into growing the heap size now. I was under the impression that this was set to some percentage by default though.

edit> GET /_cluster/stats shows:

Node1:
      "mem" : {
        "heap_used_in_bytes" : 456437880,
        "heap_max_in_bytes" : 1037959168
		
Node2:
      "mem" : {
        "heap_used_in_bytes" : 1574579536,
        "heap_max_in_bytes" : 2075918336

So I believe this is not yet a heap size issue. Am I reading this correctly? Also wondering why my max heap size is different per node... Is this the right command used to check this setting? I see an older suggestion to use /_cluster/nodes/stats?jvm=true&pretty=true' but this fails due to unrecognized parameters: [jvm], [nodeId]. The doc provided provides how to set the heap size but not how to check it, which I would like to do first to validate the issue.

No, as per the docs I shared above, it defaults to 1GB and it's up to you to change this if you need it.

Do you expect your nodes to have respectively a 1GB and a 2GB maximum heap size? It's odd that they're different indeed, and also that's not a lot of heap for a production cluster. The circuit_breaking_exception message you quoted indicates that the heap on the node with the 1GB heap is 95% full and this is the reason it's refusing to do any meaningful work.

I'm not sure I follow. You shared the allocation explain output showing us that the primary is healthy. This is nice, of course, but I think you meant to share the output about the replica. For this, you should say "primary": false in your request to the allocation explain API.

1 Like

GET /_nodes/stats Returns some better data:

(Node1)
  "jvm" : {
    "timestamp" : 1562859040471,
    "uptime_in_millis" : 88730628,
    "mem" : {
      "heap_used_in_bytes" : 764154968,
      "heap_used_percent" : 73,
      "heap_committed_in_bytes" : 1037959168,
      "heap_max_in_bytes" : 1037959168,
      "non_heap_used_in_bytes" : 144271624,
      "non_heap_committed_in_bytes" : 188936192,

(Node2)
  "jvm" : {
    "timestamp" : 1562859040623,
    "uptime_in_millis" : 88154937,
    "mem" : {
      "heap_used_in_bytes" : 984559200,
      "heap_used_percent" : 94,
      "heap_committed_in_bytes" : 1037959168,
      "heap_max_in_bytes" : 1037959168,
      "non_heap_used_in_bytes" : 144338712,
      "non_heap_committed_in_bytes" : 182394880,

Indeed that percentage is too high, and I acknowledge what you say about the 1GB default. It's possible that during my research I found this % based on an older version. Either way, this is definitely an issue that needs to be addressed - I will report back once I grow the heap size.

edit > "heap_max_in_bytes" : 4225236992, (on both nodes) looks like the size has grown successfully. Replica shard is still showing the same INITIALIZING error, so I'll repeat those error collection steps.

GET /_cluster/allocation/explain
{
  "index": "private_index",
  "shard": 0,
  "primary": true
}

returns:

{
  "index" : "private_index",
  "shard" : 0,
  "primary" : true,
  "current_state" : "started",
  "current_node" : {
    "id" : "ihTj7kz6QQewpWM-FMo6CA",
    "name" : "node1",
    "transport_address" : "node1_ip:9300",
    "attributes" : {
      "ml.machine_memory" : "16656506880",
      "ml.max_open_jobs" : "20",
      "xpack.installed" : "true"
    },
    "weight_ranking" : 1
  },
  "can_remain_on_current_node" : "yes",
  "can_rebalance_cluster" : "no",
  "can_rebalance_cluster_decisions" : [
    {
      "decider" : "cluster_rebalance",
      "decision" : "NO",
      "explanation" : "the cluster has unassigned shards and cluster setting [cluster.routing.allocation.allow_rebalance] is set to [indices_all_active]"
    }
  ],
  "can_rebalance_to_other_node" : "no",
  "rebalance_explanation" : "rebalancing is not allowed",
  "node_allocation_decisions" : [
    {
      "node_id" : "7LIgwS6NQEuF0H8S-lLZCA",
      "node_name" : "node2",
      "transport_address" : "node2_ip:9300",
      "node_attributes" : {
        "ml.machine_memory" : "16656506880",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "weight_ranking" : 1,
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[private_index][0], node[7LIgwS6NQEuF0H8S-lLZCA], [R], s[STARTED], a[id=XeahuLK8QkqD0iKbKy-1sw]]"
        }
      ]
    }
  ]
}

Doing my due diligence, I see some differences:
There is a new "decider", though it is followed by an identical "cluster_rebalance"

{
  "decider" : "rebalance_only_when_active",
  "decision" : "NO",
  "explanation" : "rebalancing is not allowed until all replicas in the cluster are active"
},

Otherwise they look identical to me. I'm interpreting this as "Can't re-balance private_index because not all replica shards are active." Is that correct? So it looks like I may not be gathering the right data here, because I know already that the replica shard I am trying to assign is not being properly assigned to the 2nd node. Further down it describes that the replica shard (I assume [R] refers to replica) is in INITIALIZING with reason master ... has not removed previously failed shard.

I'm going to do some googling on this master ... has not removed previously failed shard.

1 Like

You're asking for the status of the primary: "primary": true. We want to see the information about the replica, for which you need to say "primary": false.

Also, best to add more posts rather than to add stuff to an existing one with edits. It's just by chance that I still had this tab open, and if I hadn't then I'd never have noticed that there was more information to read.

Oh yes, I understand. I felt I was still doing my research and did not want to pester/notify just yet.

Ah I see now. I ran this earlier and thought it looked the same so I skimmed right past it. I also thought "current_state" : "started", couldn't possibly be true when /_cat/shards showed otherwise.

primary true:
          "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[private_index][0], node[7LIgwS6NQEuF0H8S-lLZCA], [R], s[STARTED], a[id=XeahuLK8QkqD0iKbKy-1sw]]"

primary false:
          "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[private_index][0], node[ihTj7kz6QQewpWM-FMo6CA], [P], s[STARTED], a[id=Ve28ZP0xQTqpAunmTBl7pA]]"

Reviewing this, the reasons appear mirrored. And as it turns out, I re-ran GET /_cat/shards and found that the replica is indeed now started!!! All of my custom indices are healthy now! Thank you David!

Looks like a little cleanup remains though, as .monitoring-es-7-2019.07.0X are not allocated and cluster health is in red, possibly because of this. I'll continue documenting my progress in the hopes that this thread can help someone else in my situation one day, but I'm hopeful I'm on the right track as things stand.

Regarding Issue 2, it appears that the indices from these 2 days have failed to allocate for the reason "stale or corrupt". Once again I will reference cluster reroute doc, specifically Forced allocation on unrecoverable errors, which describes allocate_stale_primary

[root@node1: /etc/elasticsearch]# curl -XGET node1_ip:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep .monitoring-es
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2364  100  2364    0     0  66088      0 --:--:-- --:--:-- --:--:-- 67542
.monitoring-es-7-2019.07.09     0 p UNASSIGNED CLUSTER_RECOVERED
.monitoring-es-7-2019.07.09     0 r UNASSIGNED REPLICA_ADDED
.monitoring-es-7-2019.07.08     0 p UNASSIGNED CLUSTER_RECOVERED
.monitoring-es-7-2019.07.08     0 r UNASSIGNED REPLICA_ADDED
.monitoring-es-7-2019.07.05     0 r STARTED
.monitoring-es-7-2019.07.05     0 p STARTED
.monitoring-es-7-2019.07.11     0 p STARTED
.monitoring-es-7-2019.07.11     0 r STARTED
.monitoring-es-7-2019.07.06     0 r STARTED
.monitoring-es-7-2019.07.06     0 p STARTED
.monitoring-es-7-2019.07.10     0 p STARTED
.monitoring-es-7-2019.07.10     0 r STARTED
.monitoring-es-7-2019.07.07     0 r STARTED
.monitoring-es-7-2019.07.07     0 p STARTED

Attempted:

POST /_cluster/reroute
{
    "commands" : [
        {
          "allocate_stale_primary" : {
                "index" : ".monitoring-es-7-2019.07.08", "shard" : 0,
                "node" : "node1",
                "accept_data_loss" : true
          }
        }
    ]
}

but this results in IndexShardRecoveryException[failed to fetch index version after copying it over]; nested: IndexShardRecoveryException[shard allocated for local recovery (post api), should exist, but doesn't, current files: [write.lock]]; nested: FileNotFoundException[no segments* file found in store(ByteSizeCachingDirectory(HybridDirectory@/var/lib/elasticsearch/nodes/0/indices/dPylp9ewSdiKnwsPzD7Z4g/0/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@2846a2ac)): files: [write.lock]]; ",

I believe in my case, these have gotten corrupt, so I am not looking for a command to destroy these. They do not exist in the GUI, I'm assuming due to the hidden element of the file name.

Simple enough, I was successful with that. DELETE /.monitoring-es-7-2019.07.09

And GET /_cluster/health now returns "green"!!!!! I can no longer spot any errors in my environment, so I believe the above is all that's needed for a clean recovery! Now to go lecture my fellow co-worker for uninstall logstash everywhere with chef :wink:

Thanks once again to @DavidTurner you've been invaluable in this process, in which I was most definitely stuck!

1 Like

Thank you David! I'm all wrapped up and ready to be put under my boss's christmas tree!

btw, is there any way to flag this as resolved or no worries there?

No probs, glad we could help. I've marked your last post as a solution - I think there's a way to do that yourself but I only get the moderators UI so I've no idea where the button would be for you.

Ah, I now see the option to mark a solution. I was looking at the original post though, to mark/tag the thread as solved/resolved, but that doesn't appear to be a feature of this forum.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.