Hi All,
We are using Elasticsearch version 1.7.3 in PERF environment and in Perf env we have 3 physical servers and each server are highly configured with 256 GB RAM & 48 cores of CPU. We have an index es_item and it contains of nearly 1.2 TB data with docs count of (402,884,197). Each server has 1 master & 2 data nodes
We indexed the docs through curl bulk api and during indexing we didn't face any errors but when we enabled replication set to "2". We faced the errors in the data node log files,
[2016-02-13 02:07:43,935][WARN ][indices.cluster] [tparhdetmi003_PERF_DATA2] [[es_item][6]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [es_item][6]: Recovery failed from [tparhdetmi003_PERF_DATA][wNpYExt_QdC-CK8YbzrYow][tparhdetmi003.enterprisenet.org][inet[/10.7.41.121:9260]]{max_local_storage_nodes=1, master=false} into [tparhdetmi003_PERF_DATA2][8XnD8M3rRqSz_3aQpF0U-w][tparhdetmi003.enterprisenet.org][inet[tparhdetmi003.enterprisenet.org/10.7.41.121:9263]]{max_local_storage_nodes=1, master=false} (no activity after [30m])
at org.elasticsearch.indices.recovery.RecoveriesCollection$RecoveryMonitor.doRun(RecoveriesCollection.java:235)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.ElasticsearchTimeoutException: no activity after [30m]
... 5 more
[2016-02-13 03:31:18,596][WARN ][indices.cluster ] [tparhdetmi003_PERF_DATA2] [[es_item][5]]marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [es_item][5]: Recovery failed from [tparhdetmi005_PERF_DATA][T0FrXJOJSsGi9kEof_jZrg][tparhdetmi005.enterprisenet.org][inet[/10.7.41.124:9260]]{max_local_storage_nodes=1, master=false} into [tparhdetmi003_PERF_DATA2][8XnD8M3rRqSz_3aQpF0U-w][tparhdetmi003.enterprisenet.org][inet[tparhdetmi003.enterprisenet.org/10.7.41.121:9263]]{max_local_storage_nodes=1, master=false} (no activity after [30m])
at org.elasticsearch.indices.recovery.RecoveriesCollection$RecoveryMonitor.doRun(RecoveriesCollection.java:235)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.ElasticsearchTimeoutException: no activity after [30m]
... 5 more
[2016-02-13 05:56:04,367][WARN ][indices.cluster ] [tparhdetmi003_PERF_DATA2] [[es_item][5]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [es_item][5]: Recovery failed from [tparhdetmi005_PERF_DATA][T0FrXJOJSsGi9kEof_jZrg][tparhdetmi005.enterprisenet.org][inet[/10.7.41.124:9260]]{max_local_storage_nodes=1, master=false} into [tparhdetmi003_PERF_DATA2][8XnD8M3rRqSz_3aQpF0U-w][tparhdetmi003.enterprisenet.org][inet[tparhdetmi003.enterprisenet.org/10.7.41.121:9263]]{max_local_storage_nodes=1, master=false} (no activity after [30m])
Getting error in master log file,
[2016-02-13 06:12:32,134][WARN ][cluster.action.shard] [tparhdetmi005_PERF_MASTER] [es_item][7] received shard failed for [es_item][7], node[T0FrXJOJSsGi9kEof_jZrg], [R], s[INITIALIZING], unassigned_info[[reason=ALLOCATION_FAILED], at[2016-02-13T09:42:35.259Z], details[shard failure [failed recovery][RecoveryFailedException[[es_item][7]: Recovery failed from [tparhdetmi004_PERF_DATA][qEL8s2ELS4CC_qMW_T_dAw][tparhdetmi004.enterprisenet.org][inet[/10.7.41.123:9260]]{max_local_storage_nodes=1, master=false} into [tparhetmi005_PERF_DATA][T0FrXJOJSsGi9kEof_jZrg][tparhebfmi005.enterprisenet.org][inet[tparhebfmi005.enterprisenet.org/10.7.41.124:9260]]{max_local_storage_nodes=1, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]]; ]]], indexUUID [SaBb1lYzTp2t_rlbzpEWWQ], reason [shard failure [failed recovery][RecoveryFailedException[[ogrds_item2][7]: Recovery failed from [tparhetmi004_PERF_DATA][qEL8s2ELS4CC_qMW_T_dAw][tparhetmi004.enterprisenet.org][inet[/10.7.41.123:9260]]{max_local_storage_nodes=1, master=false} into [tparhetmi005_PERF_DATA][T0FrXJOJSsGi9kEof_jZrg][tparhetmi005.enterprisenet.org][inet[tparhetmi005.enterprisenet.org/10.7.41.124:9260]]{max_local_storage_nodes=1, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]]; ]]
Please help us to solve this error.
Let us know if need any other details.
Thanks,
Ganeshbabu R