How to stop ES 1.6.0 intermittently hanging?

We are just testing elasticsearch and have noticed that intermittently it will hang. We noticed this when we moved from running ES on a single node to a cluster of 3 nodes. This is using Java 1.7.0_79 OpenJDK with the following JVM flags:

java -Xms2g -Xmx2g -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC -Dfile.encoding=UTF-8 -Djava.net.preferIPv4Stack=true -Djna.nosys=true -Delasticsearch -Des.pidfile=/var/run/elasticsearch/elasticsearch.pid -Des.path.home=/usr/share/elasticsearch -cp :/usr/share/elasticsearch/lib/elasticsearch-1.6.0.jar:/usr/share/elasticsearch/lib/*:/usr/share/elasticsearch/lib/sigar/* -Des.default.config=/etc/elasticsearch/elasticsearch.yml -Des.default.path.home=/usr/share/elasticsearch -Des.default.path.logs=/var/log/elasticsearch -Des.default.path.data=/var/lib/elasticsearch -Des.default.path.work=/tmp/elasticsearch -Des.default.path.conf=/etc/elasticsearch org.elasticsearch.bootstrap.Elasticsearch

and using typically 50% of the heap space. When tests (creating/deleting/mapping/snapshot) are run they occasionally hang and produce this type of ProcessClusterEventTimeoutException:

[2015-06-28 19:33:56,510][DEBUG][action.admin.indices.mapping.put] [dev-elastic1] failed to put mappings on indices [[test__elastic_pa]], type [marker] org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (put-mapping [marker]) within 30s at org.elasticsearch.cluster.service.InternalClusterService$2$1.run(InternalClusterService.java:278) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

In the logs on the master node this debug message appears occasionally for both the other nodes:

[2015-06-28 19:30:21,350][DEBUG][action.admin.cluster.node.stats] [dev-elastic1] failed to execute on node [UDjdroReRMS9rwU7EVh3_Q] org.elasticsearch.transport.ReceiveTimeoutTransportException: [dev-elastic2][inet[/192.168.175.97:9300]][cluster:monitor/nodes/stats[n]] request_id [12752557] timed out after [15000ms] at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

All nodes are master and data nodes. At the moment the master node happens to be the same node the client is configured to use. However we have noticed the same issue when another node was elected master.

Thanks

What's your config look like, how much data do you have in the cluster?

The only config changes in elasticsearch.yml (apart from cluster, node name, path data):

bootstrap.mlockall: true
discovery.zen.minimum_master_nodes: 2

"primaries" : {
  "store" : {
    "size_in_bytes" : 36196996217,
    "throttle_time_in_millis" : 257
  }
},
"total" : {
  "store" : {
    "size_in_bytes" : 72393991004,
    "throttle_time_in_millis" : 576
  }
}


  "docs" : {
    "count" : 290971216,
    "deleted" : 47677
  },

Thanks

Are you using multicast then?
What if you try unicast.

Thanks. Yes this is using multicast. Switching to unicast seems to make it slightly harder to repeat the exception but I still can unfortunately:

[2015-06-29 10:08:49,229][INFO ][cluster.metadata ] [dev-elastic2] [test__snp_auto_tests_xxx] deleting index [2015-06-29 10:09:01,051][DEBUG][action.admin.cluster.health] [dev-elastic2] observer: timeout notification from cluster service. timeout setting [30s], time since start [30.7s] [2015-06-29 10:09:08,298][INFO ][cluster.metadata ] [dev-elastic2] [test__json_test] creating index, cause [api], templates [], shards [5]/[1], mappings [] [2015-06-29 10:09:19,086][INFO ][cluster.metadata ] [dev-elastic2] [test__study_tests_xxx] deleting index [2015-06-29 10:09:36,983][DEBUG][action.admin.cluster.node.stats] [dev-elastic2] failed to execute on node [zwmOTZ5gSY-jrxj9gAEmfw] org.elasticsearch.transport.ReceiveTimeoutTransportException: [dev-elastic1][inet[/192.168.175.96:9300]][cluster:monitor/nodes/stats[n]] request_id [226031] timed out after [15000ms] at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) [2015-06-29 10:09:42,674][INFO ][cluster.metadata ] [dev-elastic2] [test__imb_criteria_auto_tests_xxx] deleting index [2015-06-29 10:09:43,209][WARN ][transport ] [dev-elastic2] Received response for a request that has timed out, sent [21227ms] ago, timed out [6227ms] ago, action [cluster:monitor/nodes/stats[n]], node [[dev-elastic1][zwmOTZ5gSY-jrxj9gAEmfw][dev-elastic1.dil.private.cimr.cam.ac.uk][inet[/192.168.175.96:9300]]], id [226031] [2015-06-29 10:09:48,989][INFO ][cluster.metadata ] [dev-elastic2] [test__alias_auto_tests_xxx] deleting index [2015-06-29 10:09:49,095][DEBUG][action.admin.indices.mapping.put] [dev-elastic2] failed to put mappings on indices [[test__json_elastic]], type [t1d] org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (put-mapping [t1]) within 30s at org.elasticsearch.cluster.service.InternalClusterService$2$1.run(InternalClusterService.java:278) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

The tests are creating/deleting only small indexes but I am still using the default 5 shards and 1 replica. They tests also create a repository to test snapshot/restore.

How many indices are you creating, are you monitoring your cluster stats?

They create about 18 indexes (with 2 - 60 documents). Running the test suite 2 or 3 times simultaneously causes the issue. I have Bigdesk and HQ plugins installed so have been looking at that. Is there something I should be looking for in particular? The cluster state turns to red as expected when the indexes are loaded:

{ "timestamp" : 1435574811186, "cluster_name" : "dev-elastic", "status" : "red", "indices" : { "count" : 24, "shards" : { "total" : 231, "primaries" : 118, "replication" : 0.9576271186440678, "index" : { "shards" : { "min" : 3, "max" : 10, "avg" : 9.625 }, "primaries" : { "min" : 3, "max" : 5, "avg" : 4.916666666666667 }, "replication" : { "min" : 0.0, "max" : 1.0, "avg" : 0.9416666666666668 } } }, "docs" : { "count" : 290971368, "deleted" : 47677 }, "store" : { "size_in_bytes" : 72394158044, "throttle_time_in_millis" : 0 }, "fielddata" : { "memory_size_in_bytes" : 0, "evictions" : 0 }, "filter_cache" : { "memory_size_in_bytes" : 17900716, "evictions" : 0 }, "id_cache" : { "memory_size_in_bytes" : 0 }, "completion" : { "size_in_bytes" : 0 }, "segments" : { "count" : 1794, "memory_in_bytes" : 604096756, "index_writer_memory_in_bytes" : 1008872, "index_writer_max_memory_in_bytes" : 3497554429, "version_map_memory_in_bytes" : 6248, "fixed_bit_set_memory_in_bytes" : 0 }, "percolate" : { "total" : 0, "time_in_millis" : 0, "current" : 0, "memory_size_in_bytes" : -1, "memory_size" : "-1b", "queries" : 0 } }, "nodes" : { "count" : { "total" : 3, "master_only" : 0, "data_only" : 0, "master_data" : 3, "client" : 0 }, "versions" : [ "1.6.0" ], "os" : { "available_processors" : 6, "mem" : { "total_in_bytes" : 25115123712 }, "cpu" : [ { "vendor" : "Intel", "model" : "Xeon", "mhz" : 2899, "total_cores" : 2, "total_sockets" : 2, "cores_per_socket" : 1, "cache_size_in_bytes" : 4096, "count" : 3 } ] }, "process" : { "cpu" : { "percent" : 116 }, "open_file_descriptors" : { "min" : 1240, "max" : 1338, "avg" : 1284 } }, "jvm" : { "max_uptime_in_millis" : 6609938, "versions" : [ { "version" : "1.7.0_79", "vm_name" : "OpenJDK 64-Bit Server VM", "vm_version" : "24.79-b02", "vm_vendor" : "Oracle Corporation", "count" : 3 } ], "mem" : { "heap_used_in_bytes" : 2755160520, "heap_max_in_bytes" : 6390153216 }, "threads" : 174 }, "fs" : { "total_in_bytes" : 5420423577600, "free_in_bytes" : 5187717169152, "available_in_bytes" : 5187717169152 }, "plugins" : [ { "name" : "bigdesk", "version" : "NA", "description" : "No description found.", "url" : "/_plugin/bigdesk/", "jvm" : false, "site" : true }, { "name" : "HQ", "version" : "NA", "description" : "No description found.", "url" : "/_plugin/HQ/", "jvm" : false, "site" : true } ] } }

I was thinking it might be an overloaded node/cluster but I don't think it is based on what you have shown.

I'm at a bit of a loss at the moment sorry, maybe someone else can help!

Thanks for your help. I think you are right. Looking at the load I now see peaks (up to 6-8) on some nodes at the time. I had not realised that small data loads would do that. I wonder if I would be better using fewer shards for these small indexes.

seems like related to when you hit disk and perhaps you have overallocated primaries per index?

how does your hardware per node compare to documented guidelines?

Yes thanks. Reducing the number_of_shards from 5 to 1 for the tests speeds them up and appears to solve this.