How to stop ES 1.6.0 intermittently hanging?

tcarver · June 28, 2015, 7:35pm

We are just testing elasticsearch and have noticed that intermittently it will hang. We noticed this when we moved from running ES on a single node to a cluster of 3 nodes. This is using Java 1.7.0_79 OpenJDK with the following JVM flags:


java -Xms2g -Xmx2g -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC -Dfile.encoding=UTF-8 -Djava.net.preferIPv4Stack=true -Djna.nosys=true -Delasticsearch -Des.pidfile=/var/run/elasticsearch/elasticsearch.pid -Des.path.home=/usr/share/elasticsearch -cp :/usr/share/elasticsearch/lib/elasticsearch-1.6.0.jar:/usr/share/elasticsearch/lib/*:/usr/share/elasticsearch/lib/sigar/* -Des.default.config=/etc/elasticsearch/elasticsearch.yml -Des.default.path.home=/usr/share/elasticsearch -Des.default.path.logs=/var/log/elasticsearch -Des.default.path.data=/var/lib/elasticsearch -Des.default.path.work=/tmp/elasticsearch -Des.default.path.conf=/etc/elasticsearch org.elasticsearch.bootstrap.Elasticsearch

and using typically 50% of the heap space. When tests (creating/deleting/mapping/snapshot) are run they occasionally hang and produce this type of ProcessClusterEventTimeoutException:


[2015-06-28 19:33:56,510][DEBUG][action.admin.indices.mapping.put] [dev-elastic1] failed to put mappings on indices [[test__elastic_pa]], type [marker]
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (put-mapping [marker]) within 30s
        at org.elasticsearch.cluster.service.InternalClusterService$2$1.run(InternalClusterService.java:278)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

In the logs on the master node this debug message appears occasionally for both the other nodes:


[2015-06-28 19:30:21,350][DEBUG][action.admin.cluster.node.stats] [dev-elastic1] failed to execute on node [UDjdroReRMS9rwU7EVh3_Q]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [dev-elastic2][inet[/192.168.175.97:9300]][cluster:monitor/nodes/stats[n]] request_id [12752557] timed out after [15000ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

All nodes are master and data nodes. At the moment the master node happens to be the same node the client is configured to use. However we have noticed the same issue when another node was elected master.

Thanks

warkolm · June 28, 2015, 9:36pm

What's your config look like, how much data do you have in the cluster?

tcarver · June 28, 2015, 10:23pm

The only config changes in elasticsearch.yml (apart from cluster, node name, path data):

bootstrap.mlockall: true
discovery.zen.minimum_master_nodes: 2

"primaries" : {
  "store" : {
    "size_in_bytes" : 36196996217,
    "throttle_time_in_millis" : 257
  }
},
"total" : {
  "store" : {
    "size_in_bytes" : 72393991004,
    "throttle_time_in_millis" : 576
  }
}


  "docs" : {
    "count" : 290971216,
    "deleted" : 47677
  },

Thanks

warkolm · June 28, 2015, 10:33pm

Are you using multicast then?
What if you try unicast.

tcarver · June 29, 2015, 9:45am

Thanks. Yes this is using multicast. Switching to unicast seems to make it slightly harder to repeat the exception but I still can unfortunately:


[2015-06-29 10:08:49,229][INFO ][cluster.metadata         ] [dev-elastic2] [test__snp_auto_tests_xxx] deleting index
[2015-06-29 10:09:01,051][DEBUG][action.admin.cluster.health] [dev-elastic2] observer: timeout notification from cluster service. timeout setting [30s], time since start [30.7s]
[2015-06-29 10:09:08,298][INFO ][cluster.metadata         ] [dev-elastic2] [test__json_test] creating index, cause [api], templates [], shards [5]/[1], mappings []
[2015-06-29 10:09:19,086][INFO ][cluster.metadata         ] [dev-elastic2] [test__study_tests_xxx] deleting index
[2015-06-29 10:09:36,983][DEBUG][action.admin.cluster.node.stats] [dev-elastic2] failed to execute on node [zwmOTZ5gSY-jrxj9gAEmfw]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [dev-elastic1][inet[/192.168.175.96:9300]][cluster:monitor/nodes/stats[n]] request_id [226031] timed out after [15000ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
[2015-06-29 10:09:42,674][INFO ][cluster.metadata         ] [dev-elastic2] [test__imb_criteria_auto_tests_xxx] deleting index
[2015-06-29 10:09:43,209][WARN ][transport                ] [dev-elastic2] Received response for a request that has timed out, sent [21227ms] ago, timed out [6227ms] ago, action [cluster:monitor/nodes/stats[n]], node [[dev-elastic1][zwmOTZ5gSY-jrxj9gAEmfw][dev-elastic1.dil.private.cimr.cam.ac.uk][inet[/192.168.175.96:9300]]], id [226031]
[2015-06-29 10:09:48,989][INFO ][cluster.metadata         ] [dev-elastic2] [test__alias_auto_tests_xxx] deleting index
[2015-06-29 10:09:49,095][DEBUG][action.admin.indices.mapping.put] [dev-elastic2] failed to put mappings on indices [[test__json_elastic]], type [t1d]
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (put-mapping [t1]) within 30s
        at org.elasticsearch.cluster.service.InternalClusterService$2$1.run(InternalClusterService.java:278)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

The tests are creating/deleting only small indexes but I am still using the default 5 shards and 1 replica. They tests also create a repository to test snapshot/restore.

warkolm · June 29, 2015, 10:10am

How many indices are you creating, are you monitoring your cluster stats?

tcarver · June 29, 2015, 10:57am

They create about 18 indexes (with 2 - 60 documents). Running the test suite 2 or 3 times simultaneously causes the issue. I have Bigdesk and HQ plugins installed so have been looking at that. Is there something I should be looking for in particular? The cluster state turns to red as expected when the indexes are loaded:


{
  "timestamp" : 1435574811186,
  "cluster_name" : "dev-elastic",
  "status" : "red",
  "indices" : {
    "count" : 24,
    "shards" : {
      "total" : 231,
      "primaries" : 118,
      "replication" : 0.9576271186440678,
      "index" : {
        "shards" : {
          "min" : 3,
          "max" : 10,
          "avg" : 9.625
        },
        "primaries" : {
          "min" : 3,
          "max" : 5,
          "avg" : 4.916666666666667
        },
        "replication" : {
          "min" : 0.0,
          "max" : 1.0,
          "avg" : 0.9416666666666668
        }
      }
    },
    "docs" : {
      "count" : 290971368,
      "deleted" : 47677
    },
    "store" : {
      "size_in_bytes" : 72394158044,
      "throttle_time_in_millis" : 0
    },
    "fielddata" : {
      "memory_size_in_bytes" : 0,
      "evictions" : 0
    },
    "filter_cache" : {
      "memory_size_in_bytes" : 17900716,
      "evictions" : 0
    },
    "id_cache" : {
      "memory_size_in_bytes" : 0
    },
    "completion" : {
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 1794,
      "memory_in_bytes" : 604096756,
      "index_writer_memory_in_bytes" : 1008872,
      "index_writer_max_memory_in_bytes" : 3497554429,
      "version_map_memory_in_bytes" : 6248,
      "fixed_bit_set_memory_in_bytes" : 0
    },
    "percolate" : {
      "total" : 0,
      "time_in_millis" : 0,
      "current" : 0,
      "memory_size_in_bytes" : -1,
      "memory_size" : "-1b",
      "queries" : 0
    }
  },
  "nodes" : {
    "count" : {
      "total" : 3,
      "master_only" : 0,
      "data_only" : 0,
      "master_data" : 3,
      "client" : 0
    },
    "versions" : [ "1.6.0" ],
    "os" : {
      "available_processors" : 6,
      "mem" : {
        "total_in_bytes" : 25115123712
      },
      "cpu" : [ {
        "vendor" : "Intel",
        "model" : "Xeon",
        "mhz" : 2899,
        "total_cores" : 2,
        "total_sockets" : 2,
        "cores_per_socket" : 1,
        "cache_size_in_bytes" : 4096,
        "count" : 3
      } ]
    },
    "process" : {
      "cpu" : {
        "percent" : 116
      },
      "open_file_descriptors" : {
        "min" : 1240,
        "max" : 1338,
        "avg" : 1284
      }
    },
    "jvm" : {
      "max_uptime_in_millis" : 6609938,
      "versions" : [ {
        "version" : "1.7.0_79",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.79-b02",
        "vm_vendor" : "Oracle Corporation",
        "count" : 3
      } ],
      "mem" : {
        "heap_used_in_bytes" : 2755160520,
        "heap_max_in_bytes" : 6390153216
      },
      "threads" : 174
    },
    "fs" : {
      "total_in_bytes" : 5420423577600,
      "free_in_bytes" : 5187717169152,
      "available_in_bytes" : 5187717169152
    },
    "plugins" : [ {
      "name" : "bigdesk",
      "version" : "NA",
      "description" : "No description found.",
      "url" : "/_plugin/bigdesk/",
      "jvm" : false,
      "site" : true
    }, {
      "name" : "HQ",
      "version" : "NA",
      "description" : "No description found.",
      "url" : "/_plugin/HQ/",
      "jvm" : false,
      "site" : true
    } ]
  }
}

warkolm · June 29, 2015, 11:05am

I was thinking it might be an overloaded node/cluster but I don't think it is based on what you have shown.

I'm at a bit of a loss at the moment sorry, maybe someone else can help!

tcarver · June 29, 2015, 1:13pm

Thanks for your help. I think you are right. Looking at the load I now see peaks (up to 6-8) on some nodes at the time. I had not realised that small data loads would do that. I wonder if I would be better using fewer shards for these small indexes.

nellicus · June 29, 2015, 10:09pm

seems like related to when you hit disk and perhaps you have overallocated primaries per index?

how does your hardware per node compare to documented guidelines?

tcarver · June 30, 2015, 1:29pm

Yes thanks. Reducing the number_of_shards from 5 to 1 for the tests speeds them up and appears to solve this.

Topic		Replies	Views
ES gone into a hung state, our production down. please help! Elasticsearch	4	998	July 6, 2017
ES hangs after some time Elasticsearch	4	536	July 6, 2017
Workaround when ES hangs on stale NFS mount? Elasticsearch	4	1553	August 11, 2017
Elasticsearch getting stucked after few iterations Elasticsearch	5	925	July 5, 2017
App hangs (with es blocking requests) Elasticsearch	5	1030	July 6, 2017

How to stop ES 1.6.0 intermittently hanging?

Related topics