We ran a performance test on my local environment with one node, one index and one shard. The number of TPS was around 60. We started low around 20 to 25 TPS and the 99th percentile was around 100ms which is what he expected. The moment we started getting close to 50 TPS, our requests started queuing, timing out (at http level) and failing. When we looked at the report, 50th percentile was around 20 seconds.
The problem is not the client threads being parked .. the problem is transport client being the bottle neck of not being able to handle many concurrent requests. VisualVM profiler tells us that almost all the time, the threads are waiting on the BaseSync.get() call.
The calls are going to one index and one shard (no replica). Everything is local as well.
As you can see that the initial calls returned back almost instantaneously, but over time requests started getting queued up and eventually when you monitor through a profiler, you can see all threads are waiting (blocked?) on Sync.get(). I hope this sheds any light. And if anyone can point me to any performance tests that were done using transport client, that'd be great as well.
Anyone else saw this issue or did any further performance tests? I would expect the transport client to be async in all possible ways but seems like it blocking on calls.
Actually I am using Observables. Here's the code sample I use (it's very close to the actual code, but not actual code)
private void multiSearch(MultiSearchRequestBuilder builder) {
logger.debug("Executing multisearch query");
long start = System.currentTimeMillis();
Observable.defer(() -> Observable.from(builder.execute()))
.map(multiSearchResponse -> {
long end = System.currentTimeMillis();
System.out.println("Total time took for query: " + Thread.currentThread().getName() + " " + (end - start));
List<SearchResult> searchResults = new ArrayList<>();
for (MultiSearchResponse.Item item : multiSearchResponse.getResponses()) {
SearchResult result = getSearchResultsFromSearchResponse(item.getResponse());
if (result.getHits().size() > 0) {
searchResults.add(result);
}
}
...
}).subscribeOn(Schedulers.io())
.subscribe(new Subscriber<List<SearchResult>>() {
@Override
public void onCompleted() {
// when the stream is completed
}
@Override
public void onError(Throwable throwable) {
deferredResult.setErrorResult(throwable);
}
@Override
public void onNext(List<SearchResult> searchResults) {
try {
deferredResult.setResult(ResponseEntity.ok(objectMapper.writeValueAsString(searchResults)));
} catch (Exception e) {
throw new RuntimeException(e);
}
}
});
I ran the test again, this time for an hour. You can see that 50th percentile is great and than suddenly 75th percentile shoots up. Somewhere, when the load starts building, the transport client blocks. Because everythign else in the code is pretty much reactive based on the listenableFuture (which is converted to observable)
So I tested and tested over the weekend and turns out the problem is not the client but it's the highlighting ! Let me start a different discussion on it and get more inputs.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.