We are observing that there are a few nodes in our cluster seem to have a spike in search queue items (correlated with a spike in latency). We are relying on slow logs for some part of debugging and I wasn't sure if the took_millis logged in slow logs includes the time a request spent in the queue before execution. Is there anyway to get a breakdown of how much time a request spent in the queue?
took_millis does not include the time spent in queue before execution. As per the slowlog docs :):
The logging is done on the shard level scope, meaning the execution of a search request within a specific shard. It does not encompass the whole search request, which can be broadcast to several shards in order to execute. Some of the benefits of shard level logging is the association of the actual execution on the specific machine, compared with request level.
Is there anyway to get a breakdown of how much time a request spent in the queue?
I can't think of a way - hopefully another user drops by who knows. However, I don't know how much this is going to help you. The queue would only be full because there are too many requests or they're being processed too slowly. I'd focus effort on resolving those underlying problems. The Profile API may help you here.
Thanks for the response. I have been trying to profile these queries, but the weird thing is, these queries are slow only during certain periods and if I re-run the query now, it executes fine and completes in 1s. That's why I was asking if the slow logs includes the queueing time, I read the documentation of slow logs, which said "execution on specific machine", but search queues are also specific to the machine, so I wasn't sure.
I'll check if anyone else knows about profiling the queue :). Anything in your metrics like CPU, disk, network I/O spikes around the time you notice the intermittent slowness?
I definitely noticed CPU usage spike (above 90%). Disk, network and I/O seem about normal.
Atleast, in one one of the hosts I was debugging, thread_pool.management.queue queue also had a spike.
In a few of the boxes I was looking at (haven't checked the config for others), I do notice that during the period when search queue is built up, there is an ongoing merge process happening (merges.currentmetric is 1 or more).
I understand merges can i/o intensive, are they CPU intensive as well? Or is the CPU spike more to do with the searches?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.