Elastic search java client degrades in performance drastically for large number of hits

Hi everyone,

I am using elastic search 0.90.5 within an akka application. I am using the elastic search java client to make the elastic search calls. I've been using this set up for a while with no problems... until now :smile:

As part of a new feature, for each request made to the application, a new actor (think of it like a new thread, but way cheaper) is spawned. It in turn makes a call to elastic search for some data. The hits returned from this query are pretty big, >1900 documents are returned. Across all actors spawned, the same elastic search TransportClient is used, i.e. the client is a singleton across the application.

On a single request, the elastic search client takes around 500ms to retrieve the results and produce the SearchResponse object. However, when the application is put under minimal load (3 enquiries per second for 30 seconds), the time taken by the elastic search client to retrieve the results increases substantially, i.e. under load each request takes 5000-6000 milliseconds. As more load is introduced, the app dies very quickly because the elastic search client becomes a bottleneck.

I've followed the pattern above for other elastic search queries and had no problems. The difference here is the number of hits returned and (i think) the ability of the client to deserialise the information into a SearchResponse.

To make the query, I am using the SearchRequestBuilder and the TransportClient. When i run the query in isolation using the elastic search head plugin, the query takes a couple of hundred milliseconds consistently to respond. I have tried using the scan/scroll approach to get the results in 100 hit chunks but it has had no affect, i.e. times still increase rapidly under load. I have even tried to create a new Transport client for each request (not advised I know) and that too exhibited the same problem.

I'm a bit at a loss of how to resolve this issue and would appreciate any advice/help in addressing it. As the request is made in realtime it needs to scale under load. If it takes 600ms to get the query and deserialise, that is fine. But under load, if it increase to 2+ seconds, then it's a problem.

Thanks
Sully

Disclaimer: I know very little about Akka :smile: A few thoughts:

  • Are you disposing of actors after they have finished processing the request? If I understand correctly, Akka Actors do not dispose of themselves automatically, so they will continue to accumulate as "zombie" actors in the background unless you free them. This, plus the response object, will continue eating memory -- causing GCs, etc
  • Related to the first, how is the app "dying"? CPU maxed out? Memory pressure causing GCs?
  • What is your app doing with the data after getting it from Elasticsearch? Possible that another resource (e.g. Disk IO, another datastore, etc) is becoming saturated and slowing down the pipeline?
  • How do the nodes in your cluster look? Idle? Maxed out on some resource?
  • What's the largest response coming back? Are they all ~1900, or can some spike to very large sizes? A few enormous results sprinkled amongst the rest could be causing problems.
  • Similarly, are you always requesting from:0, or paginating deeply? Deep pagination in Elasticsearch can be expensive.

Obligatory disclaimer that 0.90.5 is really old and you should probably upgrade. Likely not related to this issue at all, but I felt required to say it :smile:

Thanks for the response polyfractal. Answers to your questions below:

  • Are you disposing of actors after they have finished processing the request? If I understand correctly, Akka Actors do not dispose of themselves automatically, so they will continue to accumulate as "zombie" actors in the background unless you free them. This, plus the response object, will continue eating memory -- causing GCs, etc

Yes, after the actor has finished retrieved the elastic search data and translated the results into the application's domain, the actor sends the message on and kills itself. From performance testing the app, the memory profile under load looks OK.

  • Related to the first, how is the app "dying"? CPU maxed out? Memory pressure causing GCs?

So this is a bit of an odd one. The app itself stops responding to any external requests. Under performance testing no requests are received when it has 'died'. From what I can tell, it looks like threads have been starved, as actors begin to timeout and no new requests are processed. Monitoring all activity, the only thing that seems to take longer and longer is the elastic search query highlighted. When app starts to die, the times seen for the elastic search query are up at the 9-10 second mark. At this point, there are a lot actors hanging around and the threads I believe have all been taken up. Though granted, I have not conclusively proven this yet. The time is for the query alone (execute().actionGet()).

  • What is your app doing with the data after getting it from Elasticsearch? Possible that another resource (e.g. Disk IO, another datastore, etc) is becoming saturated and slowing down the pipeline?

The data is translated into the application's internal domain and passed onto the next actor for further processing. I'm confident that the slowdown in the application is occuring around this elastic search query. I've even taken everything else out so all the query is being executed for each request and I can clearly still see the slowdown and application going dead, i.e. not responding the new requests and actors timing out.

  • How do the nodes in your cluster look? Idle? Maxed out on some resource?

When running it locally, the cluster nodes are running hot, i.e. high CPU. On the performance environment, with much large elastic search instances, the CPU does spike to 20-30% when the application is under load. It doesn't go higher than that.

  • What's the largest response coming back? Are they all ~1900, or can some spike to very large sizes? A few enormous results sprinkled amongst the rest could be causing problems.

It's consistently ~1900. The document size for each hit is pretty big. The results are different though so it's not something I can cache within the application.

  • Similarly, are you always requesting from:0, or paginating deeply? Deep pagination in Elasticsearch can be expensive.

I am always requesting from :0 and setting the max of 3000 (set to be the cap which I can't reduce). I am not paginating at the moment, everything is required for processing down stream. I tried the scan/scroll approach but it didn't seem to help.

Point taken about using an old version of 0.90.5. No defence available! It's on our roadmap to upgrade but unfortunately we don't have time at the moment, with the significant breaking changes in the newer versions.

I'm wondering whether the problem is the amount of hits being returned and the fairly large size of each document? Would doing a number of smaller elastic search requests, each done in a separate actor in a map reduce style would help here? Or am I misusing elastic search in the data I am trying to retrieve?

Also, is the structure of the data in my index contributing to this problem? If it is overly complex, could that be a potential cause of the issue?

I ran into the same issue recently and trying to solve it. The behavior that you described is exactly the same for me: I use one shared instance of TransportClient in my web service app running on Tomcat connecting to elasticsearch. I instrumented the app using JProfiler and noticed that majority of time spent is just waiting for execute.actionGet() method to finish. I test the app using JMeter by sending 100 requests in one second so that I can simulate expected real load.

What I observed is that when I modify the search query so that it returns nothing, the response time is really low. When I send lets say 50 requests at once, it's fine - response time of my service is 20ms, but when I save 100 requests, response time per request is 200ms. The interesting fact I found is that when I send the same query 100 times in one second to elasticsearch using JMeter, it takes only 30ms to finish each request, so elasticsearch itself is fast, thus the TransportClient must be the bottleneck.

Any progress on this?

I am facing the same issue . When I use curl to get response , it takes around 40-50 ms . But When same query is executed using execute().actionGet() using TransportClient , it takes around 1000 ms. Trying to figure out the root cause and the solution of this problem . Could anybody pls help me out with this