Question about "FileNotFoundException(no segments* file found)" while allocating replica shard

Hyunsoo_Shim · April 3, 2020, 12:27am

Hello, we have got an issue when we have replica allocation after making primary shard.

[System ENV]

ES version : 6.5.2
Cluster has total 6 nodes (one node per one instance)

[Scenario]
At first, we created an index and inserted about 600,000 docs through bulk api.
The result is below.

index shard number : 1
index replica number : 0
index size : 1.9GB (there is only 1 primary shard)
translog operaion : 99541
translog size : 537M
all segments are commited

And then, we set replica number to "5" to allocate replica to all nodes.
Master node started to allocate replica shard to each node.
FYI, we didn't change "cluster.routing.allocation.node_incoming/outgoing_recoveries" values.
That means we used defalut value(2) for concurrent shard recoveries.

Question 1
I saw Translog stage after Index stage from Kibana, and all translog operations(99541) seemed to be performed.
As I mentioned, all segments are commited for primary shard already and I wonder why translog operations are performed for the replica again.
(I guess those operations are for verification of allocated replica shards)
Also, I don't know why "uncommited_operation" existed like below after all recoveries for replica were done.

Primary Translog
  "translog" : {
    "operations" : 99541,
    "size_in_bytes" : 560970997,
    "uncommitted_operations" : 0,
    "uncommitted_size_in_bytes" : 55,
    "earliest_last_modified_age" : 0
  },

Total Translog
  "translog" : {
    "operations" : 597246,
    "size_in_bytes" : 3365768232,
    "uncommitted_operations" : 18725,
    "uncommitted_size_in_bytes" : 118041690,
    "earliest_last_modified_age" : 0
  },

Question 2
I have checked the segment file names of destination node start with "recovery.xxxx" prefix and remove that prefix after replica allocation.
(e.g., "recovery.4MCUjTEbReWPnfSUX3gsGA._0.cfe" -> "_0.cfe")
Could we know when does this action happen? (After Index stage or Translog stage?)

Question 3
I saw too many warning logs during Translog stage.
The key point of logs are below,

WARN message : o.e.g.G.InternalReplicaShardAllocator][node_name][index_name][0]: failed to list shard for shard_store on node
...
Caused by: java.io.FileNotFoundException: no segments* file found in store...
files: [
recovery.4MCUjTEbReWPnfSUX3gsGA._0.cfe,
recovery.4MCUjTEbReWPnfSUX3gsGA._0.cfs,
recovery.4MCUjTEbReWPnfSUX3gsGA._0.si,
recovery.4MCUjTEbReWPnfSUX3gsGA._0_1h.liv,
recovery.4MCUjTEbReWPnfSUX3gsGA._1.cfe,
....
recovery.4MCUjTEbReWPnfSUX3gsGA._y.cfs,
recovery.4MCUjTEbReWPnfSUX3gsGA._y.si,
recovery.4MCUjTEbReWPnfSUX3gsGA._y_s.liv,
recovery.4MCUjTEbReWPnfSUX3gsGA.segments_64,
write.lock
]
...

Could you let me know what made those logs?
(Even master node sometimes can't work for some minutes after those warning logs..)

One more interesting thing is that the above warning logs are not happened when we set replica 1 instead of 5.
Also I have not found this issue when I increase replica number in regular succession
(i.e., set replica 1 -> finish -> set replica 2 -> finish ... -> set replica 5 -> finish).

I guess there maybe some sync issue while allocating of many replica shards.
Could you help me?

Thank you!

Hyunsoo_Shim · April 3, 2020, 7:49am

I have modified one thing for "question 3"
The issue of "FileNotFoundException(no segments* file found)" was not happened during Translog Stage.
As I mentioned, we did set replica number 5 and ES engine started to allocate 2 replicas first since recovery concurrency was default value(2).
The issue seemed to be happened in the start time for the 3rd replica immediately after completion of recovery for the first 2 replicas.

system · May 1, 2020, 7:50am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Question about the problem of “FileNotFoundException" while allocating replicas Elasticsearch	5	677	May 5, 2020
Allocate_stale_primary appears to succeed on wrong node Elasticsearch	5	705	April 12, 2019
About "IndexNotFoundException" after replica set Elasticsearch	2	414	January 27, 2020
Why replica shard is not allocated Elasticsearch	17	2371	February 14, 2021
How Translog Work on elastic Elasticsearch	7	407	April 8, 2023

Question about "FileNotFoundException(no segments* file found)" while allocating replica shard

Related topics