Is it me or is ES 1.6.0 node startup/recovery slower then before?

On the tin it sais it's supposed to be faster,

I follow these steps...

1- transient.cluster.routing.allocation.enable: none
2- shutdown the specific node
3- Do what ever updates (In this case windows updates)
4- Reboot the server
5- Start ES on that node
6- transient.cluster.routing.allocation.enable: all

In 1.5.2 the shards would be reallocated within a 1 minute max. Now it seems to be transferring shards over from another node. Rather then rebuilding from the shard on the local node.

There's no pending tasks: GET _cluster/pending_tasks

These are my current settings,..
{
"persistent": {
"indices": {
"store": {
"throttle": {
"max_bytes_per_sec": "200mb"
}
}
}
},
"transient": {
"cluster": {
"routing": {
"allocation": {
"enable": "all"
}
}
},
"indices": {
"recovery": {
"concurrent_streams": "6",
"translog_size": "1024kb",
"translog_ops": "2000",
"concurrent_small_file_streams": "4",
"max_bytes_per_sec": "200mb",
"file_chunk_size": "1024kb"
}
}
}
}

This is the last cat....

curl http://xxxxxx:9200/_cat/recovery | grep index
myindex 3 516194 relocation index Node02 Node01 n/a n/a 259 66.0% 174063687156 11.4% 259 174063687156 0 100.0% 0
myindex 4 516176 relocation index Node03 Node01 n/a n/a 259 64.5% 174074979560 11.9% 259 174074979560 0 100.0% 0

Unless I did something wrong in the steps described above this is really slow. It's been 3 hour and it's not even half way recovering the shards. Not bulking, not searching just waiting patiently :slight_smile:

Try doing a synced flush before the restart.

I found that this made recovery almost instant.

Will try it on the next node... When this one is done...

Also can you give some info on the size of your indices and what kind of hardware you have behind them?

4 nodes.
Each node is: 32 cores, 128GB RAM ES_HEAP=30GB and SSDs in Raid 0, 6GB/s SAS

6 indexes of 8 Shards + 1 replica, Each shard is about 140GB
1.3 Billion documents over the 6 indexes.

I'm pretty sure in 1.5.2 it was pretty speedy. Max 1 minute to re-assign the shards after a restart. I'm not talking about recovery from failed node but regular rolling restart by following the steps described above, but not including the sync step.

Btw still recovering...

As you can see all the shards are coming locally off the same node there's no relocation so it should be pretty fast right?
Except for the first index, which surprised me it relocated the shards.

You're right, it "should" be fast, but for some reason Elasticsearch seems to do this when it finds inconsistencies between the primary shard and the replica shard. I am not sure exactly what triggers it but if ES finds one of these inconsistencies it will replicate the existing shard over to the node that has been restarted instead of simply allocating the old shard.

I think what ES looks for is inconsistencies between the shards in the translog and the segments. Likely this means that you restarted the node while it still had a full translog and elasticsearch didn't like that. Hopefully doing a synced flush will prevent this in the future.

With that said, I am surprised it is taking over 3 hours to reallocate the shards, though your indices are quite large.

1Gbs network, the node is reporting 50% Network usage, though I have copied files from node to node just to test it and it was 100% usage and quite fast.

Ill wait and test sync flush. but I swear it was faster before. The only main difference is I had 5 indexes instead of 6.

Without synced flush the recovery process is pretty similar to rsync. The trouble with that is that the primary and replica shards drift from each other pretty substantially. So time to green is mostly a function of how many index and update operations you've done to since the last round of restarts.

Check the throttle on the recovery - it defaults to something pretty slow iirc. Its a reasonably sensible default if you want to make sure that recovery doesn't impact search performance.

Ok so I just tried...

1- Cluster.routing.allocation.enable to none (clicked on kopf lock)
2- POST /_all/_flush/
3- Shutdown down through kopf
4- Restart
5- Re-enabled allocation

It decided to go pull the shards off another node. I haven't written to the indexes in over 15 hours. And now it will just a good 10 hours to recover...

The only indexes update constantly are the marvel indexes. But those are peanuts compared to my indexes.

Spoke to soon. All shards but 2 are recovered. It took at least an hour though. I still don't know though why it picked two shards to relocate from other nodes those are still recovering. Maybe the relocation blocked the other nodes? Like I said in 1.5.2 under a minute all shards where recovered. And this is strictly not writing to the indexes at all.

This is an issue I have run into as well. Once shard allocation is enable it seems like it is essentially a race between the shards belonging to other nodes and the unallocated shards for which will claim the empty spot on the restarted node. Thus, if a shard from another node wins the race it will cause all kinds of problems of reallocating and re balancing.

Seems like there should be a setting to give priority to unallocated shards over shards that already belong to another node.

Yes this is like the 4th time I try the rolling update process. For some reason it decides to relocate 2 shards at the beginning, that seems to go on for a while (maybe about an hour), then the rest of the shards allocate fairly quickly and now it's relocating 2 more shards to balance out for the first two.

Ok so cluster.routing.allocation.node_concurrent_recoveries setting it to 4 gave me speedy recovery to green max 1 minute tops. But the fact that it still chose to relocate 2 shards is still happening and eludes me. Considering the fact that I have not written any new data to any indexes since the past 48 hours.

It might be a good idea to try writing a large amount of data and try restarting again as a test. I found that as long as I wasn't indexing new data, successive restarts would start to get faster and faster until they were almost instant. The real issue presents itself in the first few restarts after indexing a bunch of new data.