Understanding execution time and memory usage of a terms aggregation?

Hello to everyone.

The cluster has 4nodes(EC2 r3.2xlarge) with 60GB of memory, 30GB allocated to the Java Heap.
We're doing a terms aggregation on a field that has 20M terms (9M distinct) spread across 19M docutments.
The index has 5 shards and 1 replica. The cluster has 4 nodes.
The query we're performing filters down to 1M documents and executes the aggregation:

 attribute :urls, Array[String], mapping:{index: :not_analyzed, doc_values: true}

{"destination_urls"=>{
   "terms"=>{"field"=>"urls", "size"=>100}
}

We tested the query by doing many in sequence it takes up to 6 seconds and apporiximatley on every 5th query it takes up to 30 seconds. We believe those worse cases happen because the nodes are garbage collecting.

Since we're having a hard time understanding what exactly is going on and how understand what the limits are I'll write down what we thing is happening and would be grateful if someone could confirm we got it right or tell us what we got wrong.

  1. terms agg gets executed on every shard
  2. field_data is read from disk and traversed
  3. for each encountered term a bucket is generated
  4. each doc associated with the term is matched against the filter and if it matches bucket's doc_count count is incremented
  5. a limited number of buckets (100) is returned to the client node performing the query
  6. the client node merges the partial results and computes the final list

Our understanding od disk usage:

  1. Since field data is loaded from disk, disk performance has an important impact on execution time.

Our understanding of memory usage:

  1. Memory is used by loading field data that needs to be traversed, but this is contained.
  2. Memory is used by the number of buckets needed to exist on every shard.
  3. Memory on the client node is used to merge the results.

Memory related assumptions:

  1. By increasing the number of shards field data for each shard will decrease but increase in total.
  2. By increasing the number of shards number of buckets per shard will decrease but increase in total.
  3. Memory on the client to merge the results will stay the same.

Our understanding of CPU usage

CPU has it's impact when sorting the buckets on every shard
CPU has it's impact when merging the results on the client

CPU related assumptions:

  1. By increasing the number of shards, we reduce the sorted sets
  2. We parallelise processing and reduce time overall
  3. We increase time for merging the results, but think this won't have a big impact

Questions:

How would you speed this up?