I have tried to restore a snapshot (ES version 1.4.4 ) size 488 GB in ES 2.4.0. (5 node cluster 30 GB Heap)
But for last 3 days its in queue with no progress.
,,,
{"tasks":[{"insert_order":55678,"priority":"NORMAL","source":"restore_snapshot[mybackup]","executing":true,"time_in_queue_millis":262369377,"time_in_queue":"3d"}]}
,,,
I've seen similar things when the index had allocation filtering on it. Though I don't think the behavior is exactly the same. Either way, it is worth checking first.
Can you post a snapshot of the running threads taken with jstack?
its very long , PFB some line. Pleas Suggest , if you think any alternate way to migrate data.
"elasticsearch[node-name-node-name][clusterService#updateTask][T#1]" #53 daemon prio=5 os_prio=0 tid=0x00007f85b51e6800 nid=0x979 runnable [0x00007f7d20f6d000]
java.lang.Thread.State: RUNNABLE
at java.lang.Object.(Object.java:37)
at java.util.zip.Deflater.(Deflater.java:168)
at org.elasticsearch.common.compress.deflate.DeflateCompressor.streamOutput(DeflateCompressor.java:126)
at org.elasticsearch.common.compress.CompressedXContent.(CompressedXContent.java:83)
at org.elasticsearch.index.mapper.DocumentMapper.(DocumentMapper.java:209)
at org.elasticsearch.index.mapper.DocumentMapper.updateFieldType(DocumentMapper.java:385)
at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:413)
- locked <0x00007f7e313703d0> (a org.elasticsearch.index.mapper.MapperService)
at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:320)
- locked <0x00007f7e313703d0> (a org.elasticsearch.index.mapper.MapperService)
at org.elasticsearch.cluster.metadata.MetaDataIndexUpgradeService.checkMappingsCompatibility(MetaDataIndexUpgradeService.java:302)
at org.elasticsearch.cluster.metadata.MetaDataIndexUpgradeService.upgradeIndexMetaData(MetaDataIndexUpgradeService.java:116)
at org.elasticsearch.snapshots.RestoreService$1.execute(RestoreService.java:228)
at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:45)
at org.elasticsearch.cluster.service.InternalClusterService.runTasksForExecutor(InternalClusterService.java:468)
at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:772)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- <0x00007f7dea543e40> (a java.util.concurrent.ThreadPoolExecutor$Worker)
"elasticsearch[node-name-node-name][[http_server_boss.default]][T#1]{New I/O server boss #34}" #52 daemon prio=5 os_prio=0 tid=0x00007f85b5169000 nid=0x978 runnable [0x00007f7d2106e000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x00007f7de9ef6b28> (a sun.nio.ch.Util$2)
- locked <0x00007f7de9ef6b10> (a java.util.Collections$UnmodifiableSet)
- locked <0x00007f7dea301bb0> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101)
at org.jboss.netty.channel.socket.nio.NioServerBoss.select(NioServerBoss.java:163)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:212)
at org.jboss.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42)
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- <0x00007f7dea531250> (a java.util.concurrent.ThreadPoolExecutor$Worker)
"elasticsearch[node-name-node-name][clusterService#updateTask][T#1]" #53 daemon prio=5 os_prio=0 tid=0x00007f85b51e6800 nid=0x979 runnable [0x00007f7d20f6d000]
java.lang.Thread.State: RUNNABLE
at java.lang.Object.<init>(Object.java:37)
at java.util.zip.Deflater.<init>(Deflater.java:168)
at org.elasticsearch.common.compress.deflate.DeflateCompressor.streamOutput(DeflateCompressor.java:126)
at org.elasticsearch.common.compress.CompressedXContent.<init>(CompressedXContent.java:83)
at org.elasticsearch.index.mapper.DocumentMapper.<init>(DocumentMapper.java:209)
at org.elasticsearch.index.mapper.DocumentMapper.updateFieldType(DocumentMapper.java:385)
at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:413)
- locked <0x00007f7e313703d0> (a org.elasticsearch.index.mapper.MapperService)
at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:320)
- locked <0x00007f7e313703d0> (a org.elasticsearch.index.mapper.MapperService)
at org.elasticsearch.cluster.metadata.MetaDataIndexUpgradeService.checkMappingsCompatibility(MetaDataIndexUpgradeService.java:302)
at org.elasticsearch.cluster.metadata.MetaDataIndexUpgradeService.upgradeIndexMetaData(MetaDataIndexUpgradeService.java:116)
at org.elasticsearch.snapshots.RestoreService$1.execute(RestoreService.java:228)
at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:45)
at org.elasticsearch.cluster.service.InternalClusterService.runTasksForExecutor(InternalClusterService.java:468)
at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:772)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
It looks like it is attempting to do the restore the index. I don't know this code super well but unless you told me it wasn't working I'd assume from the stack trace that it was. Does that thread change as time goes on?
I'm not really sure what is up, frankly. Are you having GCs or something?
Yes , i can see the ip:9200/_nodes/hot_threads is changing as time goes on. But last time it took 9 days and nothing happened. This time i have enabled shard recovery debug and started restore again but nothing in the log, keep going on.
For rolling upgrade , I have tried to copy the index folder from 1.4.4 /etc/es/clustername/nodes/0/indices/ and paste in target 2.4.0 location with different name and restarted the node but only process start. Node did' join the cluster but i can see same jstack log here also.
I have also tried to re-index using logstash's elasticsearch input and output plugin. It work fine for all other indexes but this index (488 GB) taking to much time.
As of now no GC, let me enable and collect the log. Please suggest me , if you have any better tested approach to migrate this much amount of data.
@Srivastavark where are these nodes running? Locally or in AWS? Which repository are you using? Did you change any throttling settings? Do you see any progress while using recovery monitoring tool such as _cat/recovery? When you run jstack several times does the stacktrace that @nik9000 posted disappear or it stays there?
Nodes are running on AWS, we have taken the backup from 1.4.4 to S3 and again copied the backup on all nodes in local fs in 2.4.0( running on AWS) and restoring from fs.
PFB the settings.
Nothing in _cat/recovery for the index. But _nodes/hot_threads changing on each hit.
Cluster Details
3 Master-data, 2 data
All master-data and data node and 64GB RAM and 1.29 TB space . one Master-data node have 32 GB and 660 GB space.
Heap - not more than half of total available memory.
at org.elasticsearch.index.mapper.DocumentMapper.<init>(DocumentMapper.java:209)
at org.elasticsearch.index.mapper.DocumentMapper.updateFieldType(DocumentMapper.java:385)
at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:413)
- locked <0x00007f7e313703d0> (a org.elasticsearch.index.mapper.MapperService)
at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:320)
- locked <0x00007f7e313703d0> (a org.elasticsearch.index.mapper.MapperService)
jstack 2nd hit
at org.elasticsearch.index.mapper.object.ObjectMapper.toXContent(ObjectMapper.java:609)
at org.elasticsearch.index.mapper.Mapping.toXContent(Mapping.java:141)
at org.elasticsearch.index.mapper.DocumentMapper.toXContent(DocumentMapper.java:394)
at org.elasticsearch.common.compress.CompressedXContent.<init>(CompressedXContent.java:88)
at org.elasticsearch.index.mapper.DocumentMapper.<init>(DocumentMapper.java:209)
at org.elasticsearch.index.mapper.DocumentMapper.updateFieldType(DocumentMapper.java:385)
at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:413)
- locked <0x00007f7e313703d0> (a org.elasticsearch.index.mapper.MapperService)
at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:320)
- locked <0x00007f7e313703d0> (a org.elasticsearch.index.mapper.MapperService)
Judging from the stack traces that you provided it looks like the problem is in the MetaDataIndexUpgradeService that is trying to upgrade your index mapping from 1.4 to 2.4 and either goes into an infinite loop or just takes very very long time to do that for some reason. I briefly reviewed the code and I don't see any places in the code that could have caused an infinite loop. Yet something in the mapping of you index is causing this issue.
I couldn't find any related issues, but it might be worth trying to restore in 2.4.4 to see if this issue was fixed there. If that also fails I can try reproducing the issue locally, but for that I would need the following 3 files from your repository:
Where NAME_OF_THE_SNAPSHOT is the name of the snapshot that you are trying to restore and INDEX_NAME is the name of the index in this snapshot. These files contain your cluster persistent settings and templates and the index mappings and settings. If you can share this information, please send it to me via a private message or email (my email is my first name at elastic.co) .
Many thanks Igor, We have 2.4.0 in production but l will try to validate in 2.4.4 as well and will update here. I have easily migrated all other indexes using logstash and snapshot/restore.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.