[1.7.2] unassigned shards after restart for indexes with no replication


#1

Hi,
I have a 1.7.2 ES cluster with 6 nodes, all running on the same host. There are several indexes on the cluster. I notice that after restarts the indexes with no replication often get into red state because of unassigned primary shards. I can increase the replication for this cluster due to its small size, but in production I can't afford to do that. How do I go about debugging this?

Here are the two indices that are unassigned after several restarts:
.kibana 2 p UNASSIGNED .kibana 0 p STARTED 0 144b 127.0.1.1 data-1 .kibana 3 p STARTED 2 29.6kb 127.0.1.1 data-3 .kibana 1 p STARTED 0 144b 127.0.1.1 data-6 .kibana 4 p STARTED 0 144b 127.0.1.1 data-2 domainrep 2 p UNASSIGNED domainrep 0 p STARTED 10066 2mb 127.0.1.1 data-1 domainrep 3 p STARTED 10049 2mb 127.0.1.1 data-3 domainrep 1 p STARTED 10081 2mb 127.0.1.1 data-6 domainrep 4 p STARTED 10048 2mb 127.0.1.1 data-2
The logs don't have any errors or warnings related to these indexes.

When I allocate the shard forcibly:
curl -XPOST 'localhost:9400/_cluster/reroute?pretty' -d '{
"commands": [ {
"allocate": {
"index": "domainrep", "shard": 2, "node": "data-4", "allow_primary": true
}
}]
}'

The shard gets allocated, but as an empty shard.
domainrep 2 p STARTED 0 115b 127.0.1.1 data-4
domainrep 0 p STARTED 10066 2mb 127.0.1.1 data-1
domainrep 3 p STARTED 10049 2mb 127.0.1.1 data-3
domainrep 1 p STARTED 10081 2mb 127.0.1.1 data-6
domainrep 4 p STARTED 10048 2mb 127.0.1.1 data-2

Once I allocate this, I cannot un-allocate as the shard is a primary shard.

What are the ways to recover from this state, without actually reindexing the documents? Any pointers will be helpful.

Thanks!


#2

Bumping the question up, with the hope of finding some answers. Another restart and another set of unassigned primaries. This time I see unassigned replicas for shards that have replication as well. Note that I always shutdown the nodes with the _shutdown command to minimize possible corruption etc.

Here is the index that has replication with unassigned primary replica.

$ curl -s -XGET http://localhost:9400/_cat/shards?v | grep ucresults
ucresults 2 p STARTED 693 10.9mb 127.0.1.1 data-1
ucresults 2 r STARTED 693 10.9mb 127.0.1.1 data-5
ucresults 0 p STARTED 661 10.4mb 127.0.1.1 data-2
ucresults 0 r STARTED 661 10.4mb 127.0.1.1 data-6
ucresults 3 p STARTED 705 10.9mb 127.0.1.1 data-2
ucresults 3 r STARTED 705 10.9mb 127.0.1.1 data-3
ucresults 1 p UNASSIGNED
ucresults 1 r UNASSIGNED
ucresults 4 p STARTED 656 10.4mb 127.0.1.1 data-4
ucresults 4 r STARTED 656 10.4mb 127.0.1.1 data-1

When I restart the nodes, somehow the system is not able get to the primary and replica for shard-1.

I am attaching the trace logs for all lines related to the index https://gist.github.com/d-dash/db2170219217967d58ec. Any help here will be highly appreciated.

Thanks,
Dash.


(Chris Earle) #3

I notice that all of the published IP addresses are 127.0.0.1, suggesting that all nodes are running on the same machine. Is it possible that you have not started all nodes, or perhaps they got started in the wrong order at some point?

If nodes share a data path (path.data, which defaults to ./data), then the order that they are started is critical to the directory that they choose. If, for instance, a master or client node starts up at the wrong point, then those shards will never recover unless there happens to be a replica or if you restart the nodes in a different order.

On that same token, you may be able to just look in the data directory (given that I see 6 nodes, there's going to be a bunch) for the missing shards to confirm this theory.


#4

Thanks Chris.

Yes, all nodes are in the same host. I configured them this was to stay below the recommended 32g memory limit for each one. Yes, they share the same data directory right now. Although the nodes start from the script and always in the same order, I can imagine race conditions in the processes. This could explain the issues I have been having with missing shards.

For the indexes I have, can I recover the shards someway and then split the data directory? Or reloading is the only option?

Thanks again for the reply,
Dash.


(Chris Earle) #5

For the indexes I have, can I recover the shards someway and then split the data directory? Or reloading is the only option?

It depends if the data is still there. Usually you can.

Shut down the nodes.

Go to the data directory (path.data, which again is defaulted to ./data), then check each node (numbered 0 through # where # is the number of nodes that you start, minus one) and look for the indices with unassigned shards. Finally, check the shards directory for the missing shards.

cd ${path.data}/${cluster.name}/nodes
ls -la ${node.number}/indices/${index.name}/shards/${shard.number}

If you find the shard, then you can copy it and it will be recoverable as long as the node is stopped. Once found, just put them "in" nodes at the same place in another node.

Finally, you should probably prevent this situation from happening again by setting path.data for each node, then also setting node.max_local_storage_nodes to 1 per node, thereby preventing it to share the same data directory and avoid this problem in the future.

path.data: /path/to/data/node3
node.max_local_storage_nodes: 1

(Jason Tedor) #6

A few quick notes, in addition to the astute comments from @pickypg.

  1. The allow_primary flag in the allocate command in the cluster reroute API forces the allocation of a new empty primary shard; at that point, you are very likely entering a scenario with data loss. While this is documented, there is discussion underway to rename this flag so that it is even more clear the consequences of this flag.

  2. If you're going to be running six Elasticsearch processes on the same machine, I recommend setting the processors setting to the number of cores divided by the number of processes running on the machine (e.g., if you're running six Elasticsearch processes on a machine with 24 physical cores with hyperthreading, set processors to 8). Otherwise, all six nodes will think that they have a fair share at the cores on the machine, size their thread pools accordingly, and then you will see a lot of contention for the cores. You need to set this per process.

  3. You should probably also set path.logs individually for each Elasticsearch process.


#7

Thanks @pickypg, and @jasontedor. These are valuable suggestions. I tried to reuse the shards as @pickypig suggested, but wasn't successful. Given the number of changes in the config files, I am going to do an index reload and see how it performs.


(system) #8