Failed to create shard, failure IOException[failed to obtain in-memory shard lock]

My cluster setup is 2 nodes, both docker based on different VMs in the same network.

My cluster health becomes yellow after few hours it start to get unassigned status one by one, till after a day all the replica shards become unassigned and when I check the shard allocation it looks like this:

So call the following command: POST /_cluster/reroute?retry_failed=true

Immediately the shards are starting to initialize:

After like 3-4 minutes, it looks like all assigned, and the cluster health is green:

So, I started using the allocation/explain API: GET /_cluster/allocation/explain?pretty
And I got:

    {
      "index" : "projects",
      "shard" : 4,
      "primary" : false,
      "current_state" : "unassigned",
      "unassigned_info" : {
        "reason" : "MANUAL_ALLOCATION",
        "at" : "2020-07-21T08:22:48.307Z",
        "details" : "failed shard on node [Vnl1IdQOTdGDZcr0qG1Wxw]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[projects][4]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]]; ",
        "last_allocation_status" : "no_attempt"
      },
      "can_allocate" : "awaiting_info",
      "allocate_explanation" : "cannot allocate because information about existing shard data is still being retrieved from some of the nodes",
      "node_allocation_decisions" : [
        {
          "node_id" : "Vnl1IdQOTdGDZcr0qG1Wxw",
          "node_name" : "eu01",
          "transport_address" : "172.18.4.6:9300",
          "node_decision" : "throttled",
          "deciders" : [
            {
              "decider" : "throttling",
              "decision" : "THROTTLE",
              "explanation" : "reached the limit of incoming shard recoveries [2], cluster setting [cluster.routing.allocation.node_concurrent_incoming_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
            }
          ]
        },
        {
          "node_id" : "gI3ylY0JTNWuCSOSJ1vN2g",
          "node_name" : "us01",
          "transport_address" : "172.18.1.11:9300",
          "node_decision" : "no",
          "deciders" : [
            {
              "decider" : "same_shard",
              "decision" : "NO",
              "explanation" : "a copy of this shard is already allocated to this node [[projects][4], node[gI3ylY0JTNWuCSOSJ1vN2g], [P], s[STARTED], a[id=X-D0rlNmRmuTSkWlR3AQ7w]]"
            },
            {
              "decider" : "throttling",
              "decision" : "THROTTLE",
              "explanation" : "reached the limit of outgoing shard recoveries [2] on the node [gI3ylY0JTNWuCSOSJ1vN2g] which holds the primary, cluster setting [cluster.routing.allocation.node_concurrent_outgoing_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
            }
          ]
        }
      ]
    }

I checked my disk space, it's 90% free, so that is the case here.

Can someone help me understand what is the issue here and why the shards are getting unassigned every day?

Thanks

What type of storage are you using?

Premium SSDs

Where is this hosted? Could there be connectivity issues?

Azure memory optimized VMs. The network should be stable I think.

Your nodes are called us01 and eu01. Are they respectively in the US and the EU?

Yes, exactly. The ping between them is around 80ms

Ok seems like the expected behaviour then, transatlantic networking isn't nearly fast or reliable enough for this.

3 Likes

Hmm. Can it be because of the following settings in the jvm.options config?

## DNS cache policy

# cache ttl in seconds for positive DNS lookups noting that this overrides the



# JDK security property networkaddress.cache.ttl; set to -1 to cache forever

-Des.networkaddress.cache.ttl=60

# cache ttl in seconds for negative DNS lookups noting that this overrides the

# JDK security property networkaddress.cache.negative ttl; set to -1 to cache

# forever

-Des.networkaddress.cache.negative.ttl=10

And the full file is here:

## JVM configuration

################################################################
## IMPORTANT: JVM heap size
################################################################
##
## You should always set the min and max JVM heap
## size to the same value. For example, to set
## the heap to 4 GB, set:
##
## -Xms4g
## -Xmx4g
##
## See https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
## for more information
##
################################################################

# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space

-Xms8g
-Xmx8g

################################################################
## Expert settings
################################################################
##
## All settings below this section are considered
## expert settings. Don't tamper with them unless
## you understand what you are doing
##
################################################################

## GC configuration
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

## G1GC Configuration
# NOTE: G1GC is only supported on JDK version 10 or later.
# To use G1GC uncomment the lines below.
# 10-:-XX:-UseConcMarkSweepGC
# 10-:-XX:-UseCMSInitiatingOccupancyOnly
# 10-:-XX:+UseG1GC
# 10-:-XX:InitiatingHeapOccupancyPercent=75

## DNS cache policy
# cache ttl in seconds for positive DNS lookups noting that this overrides the
# JDK security property networkaddress.cache.ttl; set to -1 to cache forever
-Des.networkaddress.cache.ttl=60
# cache ttl in seconds for negative DNS lookups noting that this overrides the
# JDK security property networkaddress.cache.negative ttl; set to -1 to cache
# forever
-Des.networkaddress.cache.negative.ttl=10

## optimizations

# pre-touch memory pages used by the JVM during initialization
-XX:+AlwaysPreTouch

## basic

# explicitly set the stack size
-Xss1m

# set to headless, just in case
-Djava.awt.headless=true

# ensure UTF-8 encoding by default (e.g. filenames)
-Dfile.encoding=UTF-8

# use our provided JNA always versus the system one
-Djna.nosys=true

# turn off a JDK optimization that throws away stack traces for common
# exceptions because stack traces are important for debugging
-XX:-OmitStackTraceInFastThrow

# flags to configure Netty
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0

# log4j 2
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true

-Djava.io.tmpdir=${ES_TMPDIR}

## heap dumps

# generate a heap dump when an allocation from the Java heap fails
# heap dumps are created in the working directory of the JVM
-XX:+HeapDumpOnOutOfMemoryError

# specify an alternative path for heap dumps; ensure the directory exists and
# has sufficient space
-XX:HeapDumpPath=data

# specify an alternative path for JVM fatal error logs
-XX:ErrorFile=logs/hs_err_pid%p.log

## JDK 8 GC logging

8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:logs/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m

# JDK 9+ GC logging
9-:-Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m
# due to internationalization enhancements in JDK 9 Elasticsearch need to set the provider to COMPAT otherwise
# time/date parsing will break in an incompatible way for some date patterns and locals
9-:-Djava.locale.providers=COMPAT

No, I don't see how adjusting the DNS caching config (or indeed any other settings) can change the fact that transatlantic networking isn't nearly fast or reliable enough for this. Clusters should be contained in a single datacenter, maybe with remote clusters elsewhere in the world.

3 Likes

Ok, I'll try that see if it helps. But meanwhile I wonder why it happens only to the index with more than one shards. Because the devicelocations index never gets an unassigned status.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.