How to improve recovery speed?

javadevmtl · June 4, 2015, 6:45pm

Using 1.5.2

I have 16 shards initializing.

I'm checking the status of recovery through _cat/recovery, but I only see the % of 1 shard at a time actually moving up and very slowly. It seems to me that it's recovering one shard at a time.

I have 4 nodes
Per node: 32 cores, ES_HEAP_SIZE = 30gb, and Sandisk Extre pro 960GB SSDS in RAID 0.

My settings are...

{
   "persistent": {},
   "transient": {
      "cluster": {
         "routing": {
            "allocation": {
               "cluster_concurrent_rebalance": "4",
               "node_concurrent_recoveries": "4",
               "enable": "all"
            }
         }
      },
      "threadpool": {
         "bulk": {
            "size": "56",
            "queue_size": "56"
         },
         "search": {
            "size": "100"
         }
      },
      "indices": {
         "store": {
            "throttle": {
               "max_bytes_per_sec": "200mb"
            }
         },
         "recovery": {
            "translog_size": "512kb",
            "translog_ops": "1000",
            "max_bytes_per_sec": "40mb",
            "file_chunk_size": "512kb"
         }
      }
   }
}

Here are my HD stats: http://tinypic.com/r/5eu1bb/8

What can I tweak?

otisg · June 4, 2015, 8:00pm

Something like this:

curl -XPUT localhost:9200/_cluster/settings -d '{
"persistent" : {
"cluster.routing.allocation.node_concurrent_recoveries" : "5"
}
}'

You can also move up the limit of max bytes per second and increase the number of concurrent streams in recovery process - so recovery will work faster

curl -XPUT localhost:9200/_cluster/settings -d '{
"persistent" : {
"indices.recovery.max_bytes_per_sec": "200mb",
"indices.recovery.concurrent_streams": 5
}
}'

Otis

Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/

javadevmtl · June 4, 2015, 8:10pm

Cool I set up those settings. But do those settings take effect on the current recovery or do the nodes need to be restarted first and the next recovery will use the new settings?

Right now I still only see 1-2 shards % going up, but not more...

javadevmtl · June 5, 2015, 2:10pm

So those settings didn't seem to make a difference it took a whole day to recover.

1- I know on regular rolling restart, where we enable and disable cluster.routing.allocation. The shards come back almost right away. I guess because it is loading from local disk.

2- If I randomly just power off a node to simulate a "crash", this takes for ever. I only see about 50% network utilisation and the IOs on the disk don't seem to be utilized much and recovery slowly limps along until it's done (try 16 hours). Though I do know that if I grab one of the big index files and manually copy it from one node to the other. I.e: Grab it from data folder of ES and just copy it to TEMP folder on another node, I can push the network usage to 100%. A 5GB file takes about 20 seconds to copy

Any other thoughts?

KlavsKlavsen · November 9, 2016, 2:58pm

I have the same issue..
curl -s -XGET 'localhost:9200/_cat/recovery?v' - shows only 1 shard increasing in percentage.. at a time. and host has pcie SSD's and not a big IO load, når big cpu load (30 cores)..

Topic		Replies	Views
Where can I find detailed description of index.recovery.* settings Elasticsearch	6	1002	July 6, 2017
Speeding up recovery Elasticsearch	1	451	January 9, 2017
Increasing shard relocation speed Elasticsearch	7	29482	July 5, 2017
Restarting of node taking much time Elasticsearch	6	2486	July 6, 2017
Is it me or is ES 1.6.0 node startup/recovery slower then before? Elasticsearch	15	1115	July 6, 2017

How to improve recovery speed?

Otis

Related topics