Optimizing term aggregations for multivalued fields

I have an ES 2.3.3 index of about 300M documents / 60 primary shards, on a pretty big cluster (8x r3.2xlarge). One of the fields is multivalued (array field) that usually has 5-20 values per document and a cardinality of about 25,000. It is not_analyzed. I'm not storing the source of the documents at all because it's not my main datastore and the primary purpose right now is analytics/aggregations.

I'm trying to optimize term aggregations on this field when there are no filters, so it has to hit all 300M documents ("summarize the whole dataset" type queries). However, these queries are still a bit slow (5-8 sec), especially if there are any sub-aggregations (10-30 sec).

I am using caching, but it takes 5 or so repetitions of the same query before it starts hitting cache. I've also turned on "collect_mode": "breadth_first" if term or date_histogram sub-aggregations are involved, which helps.

The only other thing I've found that really seems to work is increasing the shard count (and my AWS budget...) But are there any other tricks for speeding up these kinds of summary queries?

Any ideas welcome. Thanks.