Index restore in pending task

Hello experts,

I have tried to restore a snapshot (ES version 1.4.4 ) size 488 GB in ES 2.4.0. (5 node cluster 30 GB Heap)
But for last 3 days its in queue with no progress.
,,,
{"tasks":[{"insert_order":55678,"priority":"NORMAL","source":"restore_snapshot[mybackup]","executing":true,"time_in_queue_millis":262369377,"time_in_queue":"3d"}]}
,,,

Can anyone help here.

Yikes!

I've seen similar things when the index had allocation filtering on it. Though I don't think the behavior is exactly the same. Either way, it is worth checking first.

Can you post a snapshot of the running threads taken with jstack?

Hi Nik,

its very long , PFB some line. Pleas Suggest , if you think any alternate way to migrate data.

"elasticsearch[node-name-node-name][clusterService#updateTask][T#1]" #53 daemon prio=5 os_prio=0 tid=0x00007f85b51e6800 nid=0x979 runnable [0x00007f7d20f6d000]
java.lang.Thread.State: RUNNABLE
at java.lang.Object.(Object.java:37)
at java.util.zip.Deflater.(Deflater.java:168)
at org.elasticsearch.common.compress.deflate.DeflateCompressor.streamOutput(DeflateCompressor.java:126)
at org.elasticsearch.common.compress.CompressedXContent.(CompressedXContent.java:83)
at org.elasticsearch.index.mapper.DocumentMapper.(DocumentMapper.java:209)
at org.elasticsearch.index.mapper.DocumentMapper.updateFieldType(DocumentMapper.java:385)
at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:413)
- locked <0x00007f7e313703d0> (a org.elasticsearch.index.mapper.MapperService)
at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:320)
- locked <0x00007f7e313703d0> (a org.elasticsearch.index.mapper.MapperService)
at org.elasticsearch.cluster.metadata.MetaDataIndexUpgradeService.checkMappingsCompatibility(MetaDataIndexUpgradeService.java:302)
at org.elasticsearch.cluster.metadata.MetaDataIndexUpgradeService.upgradeIndexMetaData(MetaDataIndexUpgradeService.java:116)
at org.elasticsearch.snapshots.RestoreService$1.execute(RestoreService.java:228)
at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:45)
at org.elasticsearch.cluster.service.InternalClusterService.runTasksForExecutor(InternalClusterService.java:468)
at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:772)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Locked ownable synchronizers:
- <0x00007f7dea543e40> (a java.util.concurrent.ThreadPoolExecutor$Worker)

"elasticsearch[node-name-node-name][[http_server_boss.default]][T#1]{New I/O server boss #34}" #52 daemon prio=5 os_prio=0 tid=0x00007f85b5169000 nid=0x978 runnable [0x00007f7d2106e000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x00007f7de9ef6b28> (a sun.nio.ch.Util$2)
- locked <0x00007f7de9ef6b10> (a java.util.Collections$UnmodifiableSet)
- locked <0x00007f7dea301bb0> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101)
at org.jboss.netty.channel.socket.nio.NioServerBoss.select(NioServerBoss.java:163)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:212)
at org.jboss.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42)
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Locked ownable synchronizers:
- <0x00007f7dea531250> (a java.util.concurrent.ThreadPoolExecutor$Worker)

https://gist.github.com/

This is the important one:

"elasticsearch[node-name-node-name][clusterService#updateTask][T#1]" #53 daemon prio=5 os_prio=0 tid=0x00007f85b51e6800 nid=0x979 runnable [0x00007f7d20f6d000]
   java.lang.Thread.State: RUNNABLE
        at java.lang.Object.<init>(Object.java:37)
        at java.util.zip.Deflater.<init>(Deflater.java:168)
        at org.elasticsearch.common.compress.deflate.DeflateCompressor.streamOutput(DeflateCompressor.java:126)
        at org.elasticsearch.common.compress.CompressedXContent.<init>(CompressedXContent.java:83)
        at org.elasticsearch.index.mapper.DocumentMapper.<init>(DocumentMapper.java:209)
        at org.elasticsearch.index.mapper.DocumentMapper.updateFieldType(DocumentMapper.java:385)
        at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:413)
        - locked <0x00007f7e313703d0> (a org.elasticsearch.index.mapper.MapperService)
        at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:320)
        - locked <0x00007f7e313703d0> (a org.elasticsearch.index.mapper.MapperService)
        at org.elasticsearch.cluster.metadata.MetaDataIndexUpgradeService.checkMappingsCompatibility(MetaDataIndexUpgradeService.java:302)
        at org.elasticsearch.cluster.metadata.MetaDataIndexUpgradeService.upgradeIndexMetaData(MetaDataIndexUpgradeService.java:116)
        at org.elasticsearch.snapshots.RestoreService$1.execute(RestoreService.java:228)
        at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:45)
        at org.elasticsearch.cluster.service.InternalClusterService.runTasksForExecutor(InternalClusterService.java:468)
        at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:772)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

It looks like it is attempting to do the restore the index. I don't know this code super well but unless you told me it wasn't working I'd assume from the stack trace that it was. Does that thread change as time goes on?

I'm not really sure what is up, frankly. Are you having GCs or something?

Yes , i can see the ip:9200/_nodes/hot_threads is changing as time goes on. But last time it took 9 days and nothing happened. This time i have enabled shard recovery debug and started restore again but nothing in the log, keep going on.

For rolling upgrade , I have tried to copy the index folder from 1.4.4 /etc/es/clustername/nodes/0/indices/ and paste in target 2.4.0 location with different name and restarted the node but only process start. Node did' join the cluster but i can see same jstack log here also.

I have also tried to re-index using logstash's elasticsearch input and output plugin. It work fine for all other indexes but this index (488 GB) taking to much time.

As of now no GC, let me enable and collect the log. Please suggest me , if you have any better tested approach to migrate this much amount of data.

@Srivastavark where are these nodes running? Locally or in AWS? Which repository are you using? Did you change any throttling settings? Do you see any progress while using recovery monitoring tool such as _cat/recovery? When you run jstack several times does the stacktrace that @nik9000 posted disappear or it stays there?

Hi Igor,

Nodes are running on AWS, we have taken the backup from 1.4.4 to S3 and again copied the backup on all nodes in local fs in 2.4.0( running on AWS) and restoring from fs.
PFB the settings.

{"persistent":{"threadpool":{"bulk":{"size":"16"},"index":{"size":"32"}},"indices":{"recovery":{"concurrent_streams":"5","max_bytes_per_sec":"500mb"}}},"transient":{"cluster":{"routing":
{"allocation":{"enable":"all"}}}}}

Nothing in _cat/recovery for the index. But _nodes/hot_threads changing on each hit.

Cluster Details
3 Master-data, 2 data
All master-data and data node and 64GB RAM and 1.29 TB space . one Master-data node have 32 GB and 660 GB space.
Heap - not more than half of total available memory.

Yml config - all default except below

cluster.name: myname
node.name: prodnode1
node.master: true
node.data: true
path.data: /data/elasticsearch
path.logs: /var/log/elasticsearch
path.repo: /data/datas3/elastic-search-backup
bootstrap.memory_lock: true
network.host: hostIP
discovery.zen.ping.unicast.hosts: ["master-data", "master-data", "master-data"]
discovery.zen.minimum_master_nodes: 2

Could you post a few examples of hot threads? It's important to understand how they are changing.

Will add jstack.

It seems to be stuck in the following place on the master node:

       ...
       org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:381)
       org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:320)
       org.elasticsearch.cluster.metadata.MetaDataIndexUpgradeService.checkMappingsCompatibility(MetaDataIndexUpgradeService.java:302)
       org.elasticsearch.cluster.metadata.MetaDataIndexUpgradeService.upgradeIndexMetaData(MetaDataIndexUpgradeService.java:116)
       org.elasticsearch.snapshots.RestoreService$1.execute(RestoreService.java:228)
       org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:45)
       org.elasticsearch.cluster.service.InternalClusterService.runTasksForExecutor(InternalClusterService.java:468)
       org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:772)
       org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
       org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
       java.lang.Thread.run(Thread.java:745)

Could you run hot threads a few more times to see if it will get out of this loop? How many indices are you trying to restore from this snapshot?

Only one index snapshot size -248 GB .

On can these two threads on each nodes/hot_threads request.

org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:413)
org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:320)

We have more than 150000 document type in our mapping.

Jstack 1st hit

   at org.elasticsearch.index.mapper.DocumentMapper.<init>(DocumentMapper.java:209)
        at org.elasticsearch.index.mapper.DocumentMapper.updateFieldType(DocumentMapper.java:385)
        at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:413)
        - locked <0x00007f7e313703d0> (a org.elasticsearch.index.mapper.MapperService)
        at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:320)
        - locked <0x00007f7e313703d0> (a org.elasticsearch.index.mapper.MapperService)

jstack 2nd hit

 at org.elasticsearch.index.mapper.object.ObjectMapper.toXContent(ObjectMapper.java:609)
    at org.elasticsearch.index.mapper.Mapping.toXContent(Mapping.java:141)
    at org.elasticsearch.index.mapper.DocumentMapper.toXContent(DocumentMapper.java:394)
    at org.elasticsearch.common.compress.CompressedXContent.<init>(CompressedXContent.java:88)
    at org.elasticsearch.index.mapper.DocumentMapper.<init>(DocumentMapper.java:209)
    at org.elasticsearch.index.mapper.DocumentMapper.updateFieldType(DocumentMapper.java:385)
    at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:413)
    - locked <0x00007f7e313703d0> (a org.elasticsearch.index.mapper.MapperService)
    at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:320)
    - locked <0x00007f7e313703d0> (a org.elasticsearch.index.mapper.MapperService)

Judging from the stack traces that you provided it looks like the problem is in the MetaDataIndexUpgradeService that is trying to upgrade your index mapping from 1.4 to 2.4 and either goes into an infinite loop or just takes very very long time to do that for some reason. I briefly reviewed the code and I don't see any places in the code that could have caused an infinite loop. Yet something in the mapping of you index is causing this issue.

I couldn't find any related issues, but it might be worth trying to restore in 2.4.4 to see if this issue was fixed there. If that also fails I can try reproducing the issue locally, but for that I would need the following 3 files from your repository:

ROOT
|- snapshot-NAME_OF_THE_SNAPSHOT
|- meta-NAME_OF_THE_SNAPSHOT
|- indices/
   | - INDEX_NAME
       |- meta-NAME_OF_THE_SNAPSHOT

Where NAME_OF_THE_SNAPSHOT is the name of the snapshot that you are trying to restore and INDEX_NAME is the name of the index in this snapshot. These files contain your cluster persistent settings and templates and the index mappings and settings. If you can share this information, please send it to me via a private message or email (my email is my first name at elastic.co) .

Many thanks Igor, We have 2.4.0 in production but l will try to validate in 2.4.4 as well and will update here. I have easily migrated all other indexes using logstash and snapshot/restore.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.