Aggregation was the culprit of the low performance.
I was even thinking about adding two more aggs but it wasn't an option with such low performance.
I have ended splitting queries in two: one for docs, one for aggregations and no docs. The latter one can be cached.
This together with custom app caching for most costly and frequent aggs has improved performance significantly: only 10% of queries take above 0.6s now.
Thanks!