ES marking and sending shard failed due to failed recovery in enabling replication

Hi All,

We are using Elasticsearch version 1.7.3 in PERF environment and in Perf env we have 3 physical servers and each server are highly configured with 256 GB RAM & 48 cores of CPU. We have an index es_item and it contains of nearly 1.2 TB data with docs count of (402,884,197). Each server has 1 master & 2 data nodes
We indexed the docs through curl bulk api and during indexing we didn't face any errors but when we enabled replication set to "2". We faced the errors in the data node log files,

[2016-02-13 02:07:43,935][WARN ][indices.cluster] [tparhdetmi003_PERF_DATA2] [[es_item][6]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [es_item][6]: Recovery failed from [tparhdetmi003_PERF_DATA][wNpYExt_QdC-CK8YbzrYow][tparhdetmi003.enterprisenet.org][inet[/10.7.41.121:9260]]{max_local_storage_nodes=1, master=false} into [tparhdetmi003_PERF_DATA2][8XnD8M3rRqSz_3aQpF0U-w][tparhdetmi003.enterprisenet.org][inet[tparhdetmi003.enterprisenet.org/10.7.41.121:9263]]{max_local_storage_nodes=1, master=false} (no activity after [30m])
at org.elasticsearch.indices.recovery.RecoveriesCollection$RecoveryMonitor.doRun(RecoveriesCollection.java:235)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.ElasticsearchTimeoutException: no activity after [30m]
... 5 more
[2016-02-13 03:31:18,596][WARN ][indices.cluster ] [tparhdetmi003_PERF_DATA2] [[es_item][5]]marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [es_item][5]: Recovery failed from [tparhdetmi005_PERF_DATA][T0FrXJOJSsGi9kEof_jZrg][tparhdetmi005.enterprisenet.org][inet[/10.7.41.124:9260]]{max_local_storage_nodes=1, master=false} into [tparhdetmi003_PERF_DATA2][8XnD8M3rRqSz_3aQpF0U-w][tparhdetmi003.enterprisenet.org][inet[tparhdetmi003.enterprisenet.org/10.7.41.121:9263]]{max_local_storage_nodes=1, master=false} (no activity after [30m])
at org.elasticsearch.indices.recovery.RecoveriesCollection$RecoveryMonitor.doRun(RecoveriesCollection.java:235)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.ElasticsearchTimeoutException: no activity after [30m]
... 5 more
[2016-02-13 05:56:04,367][WARN ][indices.cluster ] [tparhdetmi003_PERF_DATA2] [[es_item][5]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [es_item][5]: Recovery failed from [tparhdetmi005_PERF_DATA][T0FrXJOJSsGi9kEof_jZrg][tparhdetmi005.enterprisenet.org][inet[/10.7.41.124:9260]]{max_local_storage_nodes=1, master=false} into [tparhdetmi003_PERF_DATA2][8XnD8M3rRqSz_3aQpF0U-w][tparhdetmi003.enterprisenet.org][inet[tparhdetmi003.enterprisenet.org/10.7.41.121:9263]]{max_local_storage_nodes=1, master=false} (no activity after [30m])

Getting error in master log file,

[2016-02-13 06:12:32,134][WARN ][cluster.action.shard] [tparhdetmi005_PERF_MASTER] [es_item][7] received shard failed for [es_item][7], node[T0FrXJOJSsGi9kEof_jZrg], [R], s[INITIALIZING], unassigned_info[[reason=ALLOCATION_FAILED], at[2016-02-13T09:42:35.259Z], details[shard failure [failed recovery][RecoveryFailedException[[es_item][7]: Recovery failed from [tparhdetmi004_PERF_DATA][qEL8s2ELS4CC_qMW_T_dAw][tparhdetmi004.enterprisenet.org][inet[/10.7.41.123:9260]]{max_local_storage_nodes=1, master=false} into [tparhetmi005_PERF_DATA][T0FrXJOJSsGi9kEof_jZrg][tparhebfmi005.enterprisenet.org][inet[tparhebfmi005.enterprisenet.org/10.7.41.124:9260]]{max_local_storage_nodes=1, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]]; ]]], indexUUID [SaBb1lYzTp2t_rlbzpEWWQ], reason [shard failure [failed recovery][RecoveryFailedException[[ogrds_item2][7]: Recovery failed from [tparhetmi004_PERF_DATA][qEL8s2ELS4CC_qMW_T_dAw][tparhetmi004.enterprisenet.org][inet[/10.7.41.123:9260]]{max_local_storage_nodes=1, master=false} into [tparhetmi005_PERF_DATA][T0FrXJOJSsGi9kEof_jZrg][tparhetmi005.enterprisenet.org][inet[tparhetmi005.enterprisenet.org/10.7.41.124:9260]]{max_local_storage_nodes=1, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]]; ]]

Please help us to solve this error.

Let us know if need any other details.

Thanks,
Ganeshbabu R

What version of Elasticsearch are you running?

It seems there are some connectivity issues, with the "no activity after 30m" message. Did you overload the network bandwidth? Or did any of the machines go down?

Also, is there enough disk space on the node? (i.e. the tparhdetmi003.enterprisenet.org/10.7.41.121:9263 node)

Hi @abeyad

We are using ES version 1.7.3 and I am not sure whether any network connectivity issue occurred or not and I have to check with the network team and let you know the details.
No machine weren't gone down and during the time cluster were showing in green state.
Yes, there is enough disk space available for all the nodes.

Thanks,
Ganeshbabu R

Hi @abeyad

Below is the response from network team..

The only error that has been logged is for an NTP server issue.

Feb 13 11:39:01 tparhdetmi003 ntpd_intres[4187]: host name not found: ntps2-pat-cbts7.companymedia.com

Please let us know your suggestions.

Thanks,
Ganeshbabu R

I'm facing the same issue with version 2.1.0. After Cluster is back, reallocation failed in a particular shard and the reason is "no activity after [30m]". what does this mean? How to recover this shard? Any help is appreciated.

That error basically indicates that during the shard recovery process, the node receiving a copy of the shard hasn't received any data for 30 minutes. Ganeshbabu, your setup strikes me as curious - do you have only 1 master eligible node? If so, that is dangerous. If master goes down or is unresponsive for any period of time, your cluster will no longer be operational.

The disks might be filling up on the node that is on the receiving end of the replica shard copy. Can you verify disk usage?

If its not disks filling up, it might help to play with the shard recovery thresholds by changing your settings on max_bytes_per_sec and concurrent_streams:
curl -XPUT localhost:9200/_cluster/settings -d '{
"transient" : {
"indices.recovery.max_bytes_per_sec" : "100mb",
"indices.recovery.concurrent_streams" : "10"
}
}'

Also, it would be useful to see the results of your HotThreads to see if there are any issues apparent: https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-nodes-hot-threads.html

It would also be useful to see each of your cluster states.

Lastly, in looking at Ganeshbabu's logs, it seems you have 2 different nodes with the same hostname (tparhdetmi003.enterprisenet.org is the hostname for tparhdetmi003_PERF_DATA and tparhdetmi003_PERF_DATA2).

Thanks for your response @abeyad

Basically we have 3 physical server and below is the node setup details,
Server name -- Node Setup
tparhdetmi003 -- Master, Data, Data2, Client
tparhdetmi004 -- Master, Data, Data2
tparhdetmi005 -- Master, Data, Data2

Below is the disk usage of the server ( tparhdetmi003)

Filesystem Size Used Avail Use% Mounted on
/dev/mapper/sysvg-root.vol
3.9G 488M 3.2G 14% /
tmpfs 127G 0 127G 0% /dev/shm
/dev/sda1 976M 50M 875M 6% /boot
/dev/mapper/sysvg-home.vol
2.0G 558M 1.3G 31% /home
/dev/mapper/sysvg-opt.vol
20G 5.4G 14G 29% /opt
/dev/mapper/sysvg-tmp.vol
20G 45M 19G 1% /tmp
/dev/mapper/sysvg-usr.vol
4.9G 2.6G 2.1G 57% /usr
/dev/mapper/sysvg-var.vol
197G 1.3G 186G 1% /var
/dev/sdb 1.1T 362G 682G 35% /app/data1
/dev/sdc 1.1T 270G 774G 26% /app/data2
/dev/sdd 1.1T 278G 766G 27% /app/data3
/dev/sde 1.1T 426G 618G 41% /app/data4
/dev/sdf 1.1T 439G 605G 43% /app/data5
/dev/sdg 1.1T 443G 601G 43% /app/data6
/dev/sdh 1.1T 71M 1.1T 1% /app/data7
/dev/sdi 1.1T 276G 769G 27% /app/data8
/dev/sdj 1.1T 71M 1.1T 1% /app/data9
/dev/sdk 1.1T 72M 1.1T 1% /app/data10
/dev/sdl 1.1T 72M 1.1T 1% /app/data11
/dev/sdm 1.1T 72M 1.1T 1% /app/data12
/dev/sdn 1.1T 331M 1.1T 1% /app/data13
/dev/sdo 1.1T 336M 1.1T 1% /app/data14
/dev/sdp 1.1T 340M 1.1T 1% /app/data15
/dev/sdq 1.1T 71M 1.1T 1% /app/data16
/dev/sdr 1.1T 71M 1.1T 1% /app/data17
/dev/sds 1.1T 71M 1.1T 1% /app/data18
/dev/sdt 1.1T 71M 1.1T 1% /app/data19
/dev/sdu 1.1T 71M 1.1T 1% /app/data20
/dev/sdv 1.1T 71M 1.1T 1% /app/data21
/dev/sdw 1.1T 71M 1.1T 1% /app/data22
/dev/mapper/sysvg-opt--app.vol
99G 2.7G 91G 3% /opt/app
cm_processes 127G 72M 127G 1% /var/run/cloudera-scm-agent/process

Definitely I will use shard recovery thresholds settings & Hot Threads during testing of replications

Regards,
Ganeshbabu R