Well, it appears that this issue was actually unrelated to Elasticsearch
after all. The problem was actually between the API Load Balancer and the
API Server nodes. We are using ElasticBeanstalk, a managed application
container, to host these API nodes. It turns out the Apache configuration
was wrong in the Amazon gold image. The keep-alive and timeout settings
were not set properly, resulting in timeouts on the load balancer every
minute, which resulted in the massive spike in response time.
In the performance environment, there is long enough between tests for all
the connections to time out form the LB. So, when the load starts up, all
the connections to all the nodes are established at once, which is why they
were all on the same schedule and also explains why this schedule would
change over time (it was dependent on what time the test started).
Since correcting the Apache configuration, I have got the occurrence rate
of requests taking longer than 500 ms down to 0.004% from 3%. And
generally, these requests take 1 second instead of 2 - 4 seconds.
So, I would say that it is not resolved except for a little more
optimization.
Thank you every one for your responses.
On Wednesday, April 15, 2015 at 4:05:27 PM UTC-4, Daryl Robbins wrote:
Thank you, Glen. I appreciate your insight!
Here is our environment:
https://lh3.googleusercontent.com/-PLejC0Yt98I/VS7BDRa23pI/AAAAAAAAAhk/MVoWqrRI8ls/s1600/ES%2BSetup.png
All nodes are running in a VPC within the same region of AWS, so
inter-node latency should be very minimal.
I was thinking the same thing about the ES LB as well. I was wondering if
we were hitting a keepalive timeout or if the level of indirection was
otherwise creating a problem in the process. So, I tried removing the ES LB
between the API Server nodes (ES clients) and the Eligible Masters earlier
today. Each API node is now configured with the private IPs of the three
eligible masters. There was no change in the observed behaviour following
this change.
The Load Balancer in front of the API Servers is pre-warmed to 10,000
requests per second. And we're only throwing a couple hundred at it for the
moment.
Thanks for the suggestion about polling various stats on the server. I'll
see what I can rig up.
On Wednesday, April 15, 2015 at 3:38:04 PM UTC-4, Glen Smith wrote:
Cool.
If I read right, your response time statistics graph includes
1 - network latency between the client nodes and the load balancer
2 - network latency between the load balancer and the cluster eligible
masters.
3 - performance of the load balancer
My interest in checking out 1 & 2 would depend on the network topology.
I would for sure want to do something to rule out 3. Any possibility of
letting at least one of the client nodes
bypass the LB for a minute or two?
Then, I might be tempted to set up a script to hit _cat/thread_pool for
60 seconds at a time, with various of the thread pools/fields, looking for
spikes.
Maybe the same thing with _nodes/stats.
On Wednesday, April 15, 2015 at 1:48:17 PM UTC-4, Daryl Robbins wrote:
Thanks, Glen. Yes, I have run top: the Java Tomcat process is the only
thing running at the time. I also checked the thread activity in JProfiler
and nothing out of the ordinary popped up.
On Wednesday, April 15, 2015 at 1:36:55 PM UTC-4, Glen Smith wrote:
Have you run 'top' on the nodes?
On Wednesday, April 15, 2015 at 8:56:20 AM UTC-4, Daryl Robbins wrote:
Thanks for your response. GC was my first thought too. I have looked
through the logs and ran the app through a profiler, I am not seeing any
spike in GC activity or any other background thread when performance
degrades. Also, the fact that the slowdown occurs exactly every minute at
the same second would point me towards a more deliberate timeout or
heartbeat.
I am running these tests in a controlled performance environment with
constant light to moderate load. There is no change in the behaviour when
under very light load. I have turned on slow logging for queries/fetches
but am not seeing any slow queries corresponding with the problem. The only
time I see a slow query is post-cold start of the search node, so it is at
least working.
On Wednesday, April 15, 2015 at 1:00:00 AM UTC-4, Mark Walkom wrote:
Have you checked the logs for GC events or similar? What about the
web logs for events coming in?
On 15 April 2015 at 09:03, Daryl Robbins darylr...@gmail.com wrote:
I am seeing a consistent bottleneck in requests (taking about 2+
seconds) at the same second every minute across all four of my client nodes
who are connecting using the transport client from Java. These nodes are
completely independent aside from their reliance on the Elasticsearch
cluster and consequently they all happen to pause at the exact same second
every minute. The exact second when this happens varies over time, but the
four nodes always pause at the same time.
I have 4 web nodes that connect to my ES cluster via transport. They
connect to a load balancer fronting our 3 dedicated master nodes. The
cluster contains 2 or more data nodes dependent on the configuration.
Regardless of the number, I am seeing the same symptoms.
Any hints on how to proceed to troubleshoot this issue on the
Elasticsearch side would be greatly appreciated. Thanks very much!
https://lh3.googleusercontent.com/-GKiOcsPXBjI/VS2ak04mzBI/AAAAAAAAAhQ/aLDlD82AddY/s1600/Screenshot%2B2015-04-14%2B18.53.24.png
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/af209904-9113-43d0-8cbc-0c85afe52611%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/af209904-9113-43d0-8cbc-0c85afe52611%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cb3a1f30-b1b9-41ac-bbdb-da94135a3f6e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.