Slow aggregation queries, only after data change (ES 2.3)


(Yoni Dor) #1

Hi.

We are trying to analyze an issue we have, where we occasionally get slow responses for a query that is usually quick.

Our queries are aggregating on a certain field entityId, which is a not-analyzed string value.
We run an aggregation query which executes a terms aggregation on that field (entityId), with a specified size.
We have noticed that while the index usually returns a reponse that takes ~10ms for that query, each time that we write to the index (indexing a new document, indexing an existing one, or deleting a doc), the next 2 queries are much slower... around 400ms. When profiling the queries we saw that those 2 slow queries return responses from two different set of shards, probably a distribution between primary shards and replicas.

We suspect that the write operation is causing the index to rebuild the data it needs in order to perform the aggregation, but don't know why should that happen.

Our cluster consists of 3 nodes, with 6 shards (+6 replicas), running on ES 2.3

Your help would be appreciated,
Yoni


(Ed) #2

Well this gets in to some really core troubleshooting

the 10ms is probably dealing with Cached data. For example after you restart everything, and issue a query how long does it take to run the query one time? This would be the "uncached performance" then run it again and that would be the cached performance

How big is your index? How many shared, CPU and heap space?

When your testing is there any indexing going on, index rotation, other people querying?

Next I would look at your IO duing the time of query, run IOTOP and watch all your hosts and see if there is any massive spike ( for less then 1 second it will be tough to see, you could try sar an adjust it to collect very often and then look at your performance

Then I would look at your system memory usage and heap

In linux the OS tries to cache disk reads which is very beneficial to ELK, If you have 0 free memory, what is your Available memory (used and cache, and buffer) if Used is 0 you could probably use some more memory or if your Heap is not fully used decrease that amount.

Then you can get in to tuning Elastic.
You can change the percentage of Heap used for Caching (this maybe a good start)

https://www.elastic.co/guide/en/elasticsearch/guide/current/_limiting_memory_usage.html


(Ed) #3

Wow 3 nodes, with 6 shards (+6 replicas),

Why do you have 6 replicas? at most I would only have 2 replica's ( 1 node = origial, 2 more = 1 for each other node)


(Yoni Dor) #4

I got misunderstood :slight_smile:
We have 12 shards in total, 6 original, and 6 are the replicas for each one of these.


(Ed) #5

ok yah sort of thought about that for a while. Yah replication of 1 :smiley: let me know how the suggestions work out


(Yoni Dor) #6

Index size is 24,695,454 docs that take 4GB in size.
total memory in the machine is 12GB, heap is configured to take upto 6GB, and in all nodes the heap is 7 to 10 percent used.

The index is in rest when i test, it is an isolated environment, So i know that i'm the only one querying it, and every write operation i make, is followed by the slow responses.

Do these numbers give you any more ideas? I currently don't know in which way to look, i don't even understand why would my index need to rebuild some cache construct when i'm using doc values that are built itertively on index time.


(Ed) #7

it helps a little but you have to look at the OS/Hardware at this point

24M records / 6 shards / 3 nodes = 1 million records each shard is searching.

So, you will probably have some Load average and IO on each system that you should check as that will have the most impact on searchs

The other is what your actual search is (Can you provide an example) but in general if your using lots of * or doing _all fields will slow you down

Try reading this document before we go much further


(Yoni Dor) #8

Hi,

We have finally managed to handle the issue by changing the terms aggregation into a filters aggregation (@eperry which i found in the post you referred me too, 10x).

We still don't know what was the root cause for the problem we had.

Thanks for all your help


(Ed) #9

Well glad you got a handle on it. :slight_smile:

I find that I am slowly converting most of my queries to filters. Seems almost silly not to cache queries too if there is spare HEAP


(system) #10

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.