Elasticsearch Aggregations taking a long time

I understand this is the most sought after topic in elasticsearch, I see lot of answers but haven't found anything convincing. Here is the problem:

I have a monthly index of 5 primary shards and 1 replica for each on 5 data nodes.

Hardware:
8 CPUs, 32 G RAM and 16G of heap. The field data circuit breaker is set at 30% and indices.breaker.total.limit is at 70%.

Number of documents on these indices are around ~100 mil. Each of these documents are about 150k in size. All of the fields are keyword analyzed.

A simple term aggregation on one of the fields takes around 60s, this grows with data in the index. I further reduced the set on which aggregations happen by using filter aggregation here is my query

What I do not understand is that running this query with just the filter aggregation filter_agg takes ~ 1s and returns 157 documents adding the term_aggregate causes the aggregate query to take more than 100s.

  • Am I missing something here, is there something wrong with the query?
  • Does the term_aggregate aggregate 157 documents which resulted from filter_agg?

GET /index-1-2016/type/_search?search_type=count
{
"aggs": {
"filter_agg": {
"filter": {
"bool": {
"must": [
{
"term": {
"search_field1": "field1",
"_cache": true
}
},
{
"term": {
"search_field2": "field2",
"_cache": true
}
}
]
}
},
"aggs": {
"term_aggregate": {
"terms": {
"field": "emails",
"size": 5,
"shard_size": 50
}
}
}
}
}
}

A simple term aggregation on one of the fields takes around 60s

I think you should start from here. The first query with a term aggregation on a "keyword" analyzed field takes time. Each shard needs to populate the fielddata for this particular field. 60s seems quite long, what do you mean by "keyword" analyzed ? You used the keyword analyzer in the definition of the field ?
What is the content of your field, is it big ?
What is the response time if you run the query several times ?

My mapping for the field looks like this:
"analysis":{
"analyzer":{
"lowercase_keyword_analyzer":{
"type":"custom",
"tokenizer":"keyword",
"filter":[
"lowercase"
]
}
}
}

...
"mappings":{
"type":{
"properties":{
...
"emails": {
"type": "string",
"analyzer": "lowercase_keyword_analyzer"
}
...
}
}
}

It is just and array of emails. Does it being an array matter?

Response time appears to be the same on an average. Doesn't decrease with subsequent invocations.

Ok thank you for the clarifications. Why are you using a keyword tokenizer ? Are you trying to find duplication in the mails ? The keyword tokenizer "tokenizes" an entire stream as a single token, this means that each mail in your aggregation counts for one entry. I suspect that the size of those tokens is problematic and is the reason why it's taking so much time. Can you describe your use case ?

The field contains array of email ids. I want them to be searchable as well. Do you think adding another field and making it a multi-field with one not_analyzed would improve performance?