We have 3 nodes ES cluster running in docker swarm (on three EC2 instances in AWS).
Docker image: docker.elastic.co/elasticsearch/elasticsearch:5.4.0
And we mount host directory (located on AWS Elastic File System) as data volume for ES data (/usr/share/elasticsearch/data) inside the containers using mount option: --mount type=bind,source=/mnt/aws-data/prd-es-data/,destination=/usr/share/elasticsearch/data,
also we use env variable to allow ES to use one root dir for keeping indices of all three nodes: --env node.max_local_storage_nodes=3
Here is the size of the EFS volume we use for ES data (/mnt/aws-data):
Filesystem Size Used Avail Use% Mounted on
udev 7.9G 0 7.9G 0% /dev
tmpfs 1.6G 90M 1.5G 6% /run
/dev/xvda1 59G 12G 45G 22% /
tmpfs 7.9G 3.0M 7.9G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 7.9G 0 7.9G 0% /sys/fs/cgroup
*****.amazonaws.com:/ 8.0E 2.1T 8.0E 1% /mnt/aws-data
But when we try to migrate indices from old prod ES to new ES cluster using elasticdump tool migration stucks and the following error occurs in ES log:
[2017-06-18T07:57:26,585][INFO ][o.e.c.m.MetaDataCreateIndexService] [nEJmGys] [prd_d7726fdb32cae67c7268a493adbf2596_1494502509] creating index, cause [api], templates [], shards [5]/[1], mappings []
[2017-06-18T07:57:27,826][WARN ][o.e.d.r.RestController ] Content type detection for rest requests is deprecated. Specify the content type using the [Content-Type] header.
[2017-06-18T07:57:27,834][INFO ][o.e.c.m.MetaDataMappingService] [nEJmGys] [prd_d7726fdb32cae67c7268a493adbf2596_1494502509/V1ZN2xrHSQuCsjHeTdsvxg] create_mapping [Patent]
[2017-06-18T07:57:28,182][WARN ][o.e.c.a.s.ShardStateAction] [nEJmGys] [prd_d7726fdb32cae67c7268a493adbf2596_1494502509][0] received shard failed for shard id [[prd_d7726fdb32cae67c7268a493adbf2596_1494502509][0]], allocation id [sVtSIk4KQiKMoKhhFCe1qw], primary term [0], message [failed recovery], failure [RecoveryFailedException[[prd_d7726fdb32cae67c7268a493adbf2596_1494502509][0]: Recovery failed from {nEJmGys}{nEJmGysfR76BJrsu_rnxNQ}{qJcy9LXrRQiMq9RF-PSUjA}{10.0.2.16}{10.0.2.16:9300}{ml.enabled=true} into {oGv4Jvl}{oGv4JvlSQKyfHBIAYWXhjg}{062lTDOYRg6Qrd6abrDQVQ}{10.0.2.9}{10.0.2.9:9300}{ml.enabled=true}]; nested: RemoteTransportException[[nEJmGys][10.0.2.16:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [1] files with total size of [130b]]; nested: RemoteTransportException[[oGv4Jvl][10.0.2.9:9300][internal:index/shard/recovery/clean_files]]; nested: IOException[Disk quota exceeded]; ]
org.elasticsearch.indices.recovery.RecoveryFailedException: [prd_d7726fdb32cae67c7268a493adbf2596_1494502509][0]: Recovery failed from {nEJmGys}{nEJmGysfR76BJrsu_rnxNQ}{qJcy9LXrRQiMq9RF-PSUjA}{10.0.2.16}{10.0.2.16:9300}{ml.enabled=true} into {oGv4Jvl}{oGv4JvlSQKyfHBIAYWXhjg}{062lTDOYRg6Qrd6abrDQVQ}{10.0.2.9}{10.0.2.9:9300}{ml.enabled=true}
...
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: org.elasticsearch.transport.RemoteTransportException: [nEJmGys][10.0.2.16:9300][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] phase1 failed
...
at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1519) ~[elasticsearch-5.4.0.jar:5.4.0]
... 5 more
Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: Failed to transfer [1] files with total size of [130b]
...
at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1519) ~[elasticsearch-5.4.0.jar:5.4.0]
... 5 more
Caused by: org.elasticsearch.transport.RemoteTransportException: [oGv4Jvl][10.0.2.9:9300][internal:index/shard/recovery/clean_files]
Caused by: java.io.IOException: Disk quota exceeded
at sun.nio.ch.FileDispatcherImpl.lock0(Native Method) ~[?:?]
at sun.nio.ch.FileDispatcherImpl.lock(FileDispatcherImpl.java:90) ~[?:?]
at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:1115) ~[?:?]
at java.nio.channels.FileChannel.tryLock(FileChannel.java:1155) ~[?:1.8.0_131]
and the cluster is in RED state:
root@68d4a7745784:/var/tmp# curl http://10.0.2.15:9200/_cat/health
1497775511 08:45:11 prd-es-cluster red 3 3 698 350 0 0 10 0 - 98.6%
So it says that index can't be created because of "Disk quota exceeded" error, as I understand.
We don't have any FS quotes enabled for users/system.
The total size of the indices we managed to migrate from old prod ES is about 5.7GB:
root@ip-172-20-0-57:/mnt/aws-data/prd-es-data/nodes# du -hs ../nodes/
5.7G ../nodes/
root@ip-172-20-0-57:/mnt/aws-data/prd-es-data/nodes# ls | xargs du -hs
1.8G 0
2.4G 1
1.6G 2
and this is much less than the size of the target EFS volume (8.0E bytes).
So it's absolutely unclear which quota limit we reached and how to overcome it.
I re-built ES cluster using docker image docker.elastic.co/elasticsearch/elasticsearch:5.4.1 and started indices migration from scratch. The process stucked again after migration of ~70 indices with the same error in the log:
2017-06-18T14:24:50.963050185Z [2017-06-18T14:24:50,962][INFO ][o.e.c.m.MetaDataCreateIndexService] [3UHXkO9] [prd_3b4a588857adcc7909e1288e1496717f_1493194435] creating index, cause [api], templates [], shards [5]/[1], mappings []
2017-06-18T14:24:51.537451601Z [2017-06-18T14:24:51,530][WARN ][o.e.i.c.IndicesClusterStateService] [3UHXkO9] [[prd_3b4a588857adcc7909e1288e1496717f_1493194435][1]] marking and sending shard failed due to [failed recovery]
2017-06-18T14:24:51.537530089Z org.elasticsearch.indices.recovery.RecoveryFailedException: [prd_3b4a588857adcc7909e1288e1496717f_1493194435][1]: Recovery failed on {3UHXkO9}{3UHXkO9yT5mICH7iqnOVzA}{4E5jH1ZYTxWgOo8GWDbPzg}{10.0.2.16}{10.0.2.16:9300}{ml.enabled=true}
2017-06-18T14:24:51.537631230Z at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$1(IndexShard.java:1490) ~[elasticsearch-5.4.1.jar:5.4.1]
2017-06-18T14:24:51.537651474Z at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.4.1.jar:5.4.1]
2017-06-18T14:24:51.537799981Z at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
2017-06-18T14:24:51.537838651Z at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
2017-06-18T14:24:51.537905963Z at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
2017-06-18T14:24:51.537927805Z Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed to recover from gateway
2017-06-18T14:24:51.537978644Z at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:365) ~[elasticsearch-5.4.1.jar:5.4.1]
2017-06-18T14:24:51.538091997Z at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:90) ~[elasticsearch-5.4.1.jar:5.4.1]
2017-06-18T14:24:51.538215958Z at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:257) ~[elasticsearch-5.4.1.jar:5.4.1]
2017-06-18T14:24:51.538248074Z at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:88) ~[elasticsearch-5.4.1.jar:5.4.1]
2017-06-18T14:24:51.538299570Z at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1238) ~[elasticsearch-5.4.1.jar:5.4.1]
2017-06-18T14:24:51.538316587Z at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$1(IndexShard.java:1486) ~[elasticsearch-5.4.1.jar:5.4.1]
2017-06-18T14:24:51.538372137Z ... 4 more
2017-06-18T14:24:51.538472140Z Caused by: org.elasticsearch.index.engine.EngineCreationFailureException: failed to create engine
2017-06-18T14:24:51.538501652Z at org.elasticsearch.index.engine.InternalEngine.(InternalEngine.java:154) ~[elasticsearch-5.4.1.jar:5.4.1]
2017-06-18T14:24:51.538555039Z at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25) ~[elasticsearch-5.4.1.jar:5.4.1]
2017-06-18T14:24:51.538571866Z at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1604) ~[elasticsearch-5.4.1.jar:5.4.1]
2017-06-18T14:24:51.538621648Z at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1586) ~[elasticsearch-5.4.1.jar:5.4.1]
2017-06-18T14:24:51.538638115Z at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:1029) ~[elasticsearch-5.4.1.jar:5.4.1]
2017-06-18T14:24:51.538930206Z at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:989) ~[elasticsearch-5.4.1.jar:5.4.1]
2017-06-18T14:24:51.538970252Z at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:360) ~[elasticsearch-5.4.1.jar:5.4.1]
2017-06-18T14:24:51.539031392Z at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:90) ~[elasticsearch-5.4.1.jar:5.4.1]
2017-06-18T14:24:51.539049313Z at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:257) ~[elasticsearch-5.4.1.jar:5.4.1]
2017-06-18T14:24:51.539090468Z at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:88) ~[elasticsearch-5.4.1.jar:5.4.1]
2017-06-18T14:24:51.539366840Z at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1238) ~[elasticsearch-5.4.1.jar:5.4.1]
2017-06-18T14:24:51.539394867Z at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$1(IndexShard.java:1486) ~[elasticsearch-5.4.1.jar:5.4.1]
2017-06-18T14:24:51.539689226Z ... 4 more
2017-06-18T14:24:51.539767060Z Caused by: java.io.IOException: Disk quota exceeded
2017-06-18T14:24:51.539835115Z at sun.nio.ch.FileDispatcherImpl.lock0(Native Method) ~[?:?]
2017-06-18T14:24:51.539867808Z at sun.nio.ch.FileDispatcherImpl.lock(FileDispatcherImpl.java:90) ~[?:?]
2017-06-18T14:24:51.540288562Z at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:1115) ~[?:?]
but this time the amount of migrated data is even less (last time it was about 5.7GB):
ubuntu@ip-172-20-0-57:/mnt/aws-data/prd-es-data$ du -hs ./nodes/
3.1G ./nodes/
So I totally lost and can't understand what is the "Disk quota" which has been exceeded?...
Do I miss something?
I found the following thread: Elasticsearch with Amazon Elastic File System
So may be the current issue is also caused by some bug in how ES works with large FS (particularly with AWS EFS mounted via NFS)?...
May be it helps somebody, I think that I found the reason of "Disk quota exceeded" error. Here is the cut from AWS EFS manual:
Command Fails with “Disk quota exceeded” Error
Amazon EFS doesn't currently support user disk quotas. This error can occur if any of the following limits have been exceeded:
Up to 128 active user accounts can have files open at once for an instance.
Up to 32,768 files can be open at once for an instance.
Each unique mount on the instance can acquire up to a total of 8,192 locks across 256 unique file-process pairs. For example, a single process can acquire one or more locks on 256 separate files, or eight processes can each acquire one or more locks on 32 files.
Apparently we hit the limit for 256 locks per process. I checked with lslocks and indeed ES locks a lot of files.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.