Indexing slowing down aggregations a lot

We are running a search engine with an index of around 6M documents (~100GB) on a 3 node i3.xlarge managed AWS cluster. Our sharding and replicas are at the default settings, so 5 shards, one replica each. We are using ES version 6.3.1. The index is constantly updated by a crawler, roughly performing 1500 creates, updates and deletes a minute (all implemented with bulk queries).

As part of our autocomplete system we are running (simple) terms aggregations combined with edge n-grammed fields (pretty much the last example here, combined with a terms aggregation).
We are aware that this is not the fastest implementation option for autocomplete, but we want to be able the handle a lot of specific contexts (so no completion suggester) and the response does not need to be ultra fast.

So with that said, running the constant indexing in the background more than doubles the query time of the aggregation query on average (from ~500ms to 1000ms roughly). However, it feels inconsistent, even with caching disabled, every once in a while there might be a fast response. Almost as if there is a background process that blocks the aggregation.

What can we do to increase the performance of this aggregation while still keeping the indexing running?

We have tried the recommended "one shard per node approach", with 1 master and two replicas (all on different i3.xlarge nodes), on a smaller 30GB test index. This only made the performance worse unfortunately (almost twice a slow).

Do we just throw more hardware at it? Is there a way to always make ES prioritize search requests over index/update/delete requests, maybe be adding even more replicas?

Thanks for any help and please ask for more details and clarification if needed.

What does CPU usage look like on the nodes? Is there anything in the logs around GC being slow or frequent?

Hi Christian!
Adding more details about the autocomplete issue we are facing with Gerben :slight_smile:

Besides sharding changes we also tried different instance families in our es cluster but no luck.
Below is a screenshot of grafana showing cpu rate and gc information:

And in application logs we got these gc warnings:

[2018-11-06T04:45:18,658][WARN ][o.e.m.j.JvmGcMonitorService] [UfBO8nG] [gc][young][414178][1443] duration [3s], collections [1]__PATH__[3.5s], total [3s]__PATH__[32.8s], memory [309.2mb]->[250.7mb]__PATH__[1015.6mb], all_pools {[young] [65.3mb]->[244.2kb]__PATH__[66.5mb]}{[survivor] [7.8mb]->[8.3mb]__PATH__[8.3mb]}{[old] [236.1mb]->[242.2mb]__PATH__[940.8mb]}
[2018-11-06T04:45:18,658][WARN ][o.e.m.j.JvmGcMonitorService] [UfBO8nG] [gc][414178] overhead, spent [3s] collecting in the last [3.5s]
[2018-11-06T04:51:22,904][WARN ][o.e.m.j.JvmGcMonitorService] [UfBO8nG] [gc][young][414535][1444] duration [7.2s], collections [1]__PATH__[8.1s], total [7.2s]__PATH__[40s], memory [313.7mb]->[257.8mb]__PATH__[1015.6mb], all_pools {[young] [63.2mb]->[579.2kb]__PATH__[66.5mb]}{[survivor] [8.3mb]->[6.7mb]__PATH__[8.3mb]}{[old] [242.2mb]->[250.5mb]__PATH__[940.8mb]}
[2018-11-06T04:51:22,904][WARN ][o.e.m.j.JvmGcMonitorService] [UfBO8nG] [gc][414535] overhead, spent [7.2s] collecting in the last [8.1s]
...
[2018-11-12T04:22:15,579][WARN ][o.e.m.j.JvmGcMonitorService] [CDiRrfk] [gc][453860] overhead, spent [535ms] collecting in the last [1s]

Thanks a lot!

It seems like you need more heap for that workload, so a larger instance type may help.

Maybe a bit off-topic, but maybe take a look at using highlights as a possible alternative for your aggregations for autocomplete :slight_smile: Depending on your data size + load this could give better performance + easier results.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.