I'm performing load tests of 200-400 tps (jmeter) on an application written in vert.x 3 which uses a tcp transport client (singleton) for performing searches on elastic 1.7.0
After a couple of minutes the app becomes unresponsive, when it is not unresponsive I get latencies in the order of 10 seconds per response.
I did a thread dump on the app and see lots of blocked threads and netstat -p tells me that all the tcp connections to the elastic client nodes (no sniffing) are established, absolutely none in wait status.
I've tried setting the listener as threaded, but not luck. Any ideas? I've tried to reproduce in an integration test, with no luck. Is there a bottleneck/limitation on the tcp client, would I be better off switching to REST?
wild guess. Maybe garbage collection is kicking in here on the server side, you only mentioned that the client side app is fine - maybe that is not the culprit. However it is very hard to tell with the given information. Have you checked the cluster log files, polled the hot threads API while running the tests, continously read the nodes stats and checked those information after your run?
Also, that Elasticsearch version really deserves an upgrade for so many reasons
Will definitely be having a look at hot threads on elastic's side and will post my findings.
I agree the elastic cluster and client need and update.
Overall in terms of memory and CPU the nodes appear to be doing fine. We've also ran similar loads with raw queries into elastic with no issues worth mentioning.
Also mocked the elastic responses on the api to discard possible issues with the api itself. So at this point it is kind of looking like the issue is either the network of the tcp transport.
But will definitely do more investigation and post my findings.
I switched my search method to use the new elastic search http client. I was again able to replicate the same issue under load after about 25 minutes.
I've taken multiple profiles of my APIs (I've got two instances running under a load balancer).
I'm happy to send anyone the java flight recorder profiles that I got from the test, I'm not allowed to upload them here, unfortunately.
I can see my data nodes with a cpu usage per core of around 98% (currently running a cluster with 2 clients, 2 data, 2 masters). All other nodes seem to be fine in all other aspects.
Not sure I'm getting very useful data out of the hot threads api.
If anyone wants the profiles to check, I'll really be happy to send them over email.
Still nothing about the server side here.
What about the log files? Can you see GC?
Why is hot threads not useful?
Checking the nodes stats might also help.
Any configuration changed from the defaults?
I take it back the information on hot threads was useful and I also monitored elastic's stats and the VM the data nodes are running on and I can definitely see threads taking a really long (above 300ms) processing time, and really high CPU usage. I scaled the cluster out and also up by adding one node and improving the cpu capacity of the boxes we're running on. And latency is now down to reasonable levels.
Also was able to find the reason why the api was not responding an this was down to the way I was handling responses across vertx's eventbus, essentially my sender was expecting a reply that never arrived.
I didn't go into GC on the cluster side, as the hot threads and stats endpoints were useful enough to figure out that the node specs was not up to scratch.
However if you'd like or find it useful for me to reproduce the previous conditions and input here the output from hot thread and stats I can do so. Might be useful for other people?
agreed, in any case I can go into a bit more info, this is the hot threads excerpt (can't fit the whole thing) output for both data nodes, data node1, I'd like to add stats but can't fit it into the post:
Hot threads at 2017-02-01T14:20:06.902Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
45.5% (227.4ms out of 500ms) cpu usage by thread 'elasticsearch[ess-elas-data1-dev1][search][T#6]'
9/10 snapshots sharing following 9 elements
org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:301)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:312)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:776)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:767)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
unique snapshot
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:278)
org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:117)
org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:45)
org.elasticsearch.script.ScriptService.getScriptFromIndex(ScriptService.java:355)
org.elasticsearch.script.ScriptService.compileInternal(ScriptService.java:316)
org.elasticsearch.script.ScriptService.compile(ScriptService.java:287)
org.elasticsearch.script.ScriptService.executable(ScriptService.java:457)
org.elasticsearch.search.SearchService.parseTemplate(SearchService.java:703)
org.elasticsearch.search.SearchService.createContext(SearchService.java:571)
org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:544)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:306)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:776)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:767)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
44.3% (221.3ms out of 500ms) cpu usage by thread 'elasticsearch[ess-elas-data1-dev1][search][T#1]'
6/10 snapshots sharing following 9 elements
org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:301)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:312)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:776)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:767)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
3/10 snapshots sharing following 21 elements
org.elasticsearch.common.xcontent.support.AbstractXContentParser.readValue(AbstractXContentParser.java:308)
org.elasticsearch.common.xcontent.support.AbstractXContentParser.readList(AbstractXContentParser.java:284)
org.elasticsearch.common.xcontent.support.AbstractXContentParser.readValue(AbstractXContentParser.java:310)
org.elasticsearch.common.xcontent.support.AbstractXContentParser.readMap(AbstractXContentParser.java:275)
org.elasticsearch.common.xcontent.support.AbstractXContentParser.readOrderedMap(AbstractXContentParser.java:258)
org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrdered(AbstractXContentParser.java:213)
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.