We are trying to analyze an issue we have, where we occasionally get slow responses for a query that is usually quick.
Our queries are aggregating on a certain field entityId, which is a not-analyzed string value.
We run an aggregation query which executes a terms aggregation on that field (entityId), with a specified size.
We have noticed that while the index usually returns a reponse that takes ~10ms for that query, each time that we write to the index (indexing a new document, indexing an existing one, or deleting a doc), the next 2 queries are much slower... around 400ms. When profiling the queries we saw that those 2 slow queries return responses from two different set of shards, probably a distribution between primary shards and replicas.
We suspect that the write operation is causing the index to rebuild the data it needs in order to perform the aggregation, but don't know why should that happen.
Our cluster consists of 3 nodes, with 6 shards (+6 replicas), running on ES 2.3
Well this gets in to some really core troubleshooting
the 10ms is probably dealing with Cached data. For example after you restart everything, and issue a query how long does it take to run the query one time? This would be the "uncached performance" then run it again and that would be the cached performance
How big is your index? How many shared, CPU and heap space?
When your testing is there any indexing going on, index rotation, other people querying?
Next I would look at your IO duing the time of query, run IOTOP and watch all your hosts and see if there is any massive spike ( for less then 1 second it will be tough to see, you could try sar an adjust it to collect very often and then look at your performance
Then I would look at your system memory usage and heap
In linux the OS tries to cache disk reads which is very beneficial to ELK, If you have 0 free memory, what is your Available memory (used and cache, and buffer) if Used is 0 you could probably use some more memory or if your Heap is not fully used decrease that amount.
Then you can get in to tuning Elastic.
You can change the percentage of Heap used for Caching (this maybe a good start)
Index size is 24,695,454 docs that take 4GB in size.
total memory in the machine is 12GB, heap is configured to take upto 6GB, and in all nodes the heap is 7 to 10 percent used.
The index is in rest when i test, it is an isolated environment, So i know that i'm the only one querying it, and every write operation i make, is followed by the slow responses.
Do these numbers give you any more ideas? I currently don't know in which way to look, i don't even understand why would my index need to rebuild some cache construct when i'm using doc values that are built itertively on index time.
We have finally managed to handle the issue by changing the terms aggregation into a filters aggregation (@eperry which i found in the post you referred me too, 10x).
We still don't know what was the root cause for the problem we had.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.