Elasticsearch Aggregations taking a long time


(photonic_world) #1

I understand this is the most sought after topic in elasticsearch, I see lot of answers but haven't found anything convincing. Here is the problem:

I have a monthly index of 5 primary shards and 1 replica for each on 5 data nodes.

Hardware:
8 CPUs, 32 G RAM and 16G of heap. The field data circuit breaker is set at 30% and indices.breaker.total.limit is at 70%.

Number of documents on these indices are around ~100 mil. Each of these documents are about 150k in size. All of the fields are keyword analyzed.

A simple term aggregation on one of the fields takes around 60s, this grows with data in the index. I further reduced the set on which aggregations happen by using filter aggregation here is my query

What I do not understand is that running this query with just the filter aggregation filter_agg takes ~ 1s and returns 157 documents adding the term_aggregate causes the aggregate query to take more than 100s.

  • Am I missing something here, is there something wrong with the query?
  • Does the term_aggregate aggregate 157 documents which resulted from filter_agg?

GET /index-1-2016/type/_search?search_type=count
{
"aggs": {
"filter_agg": {
"filter": {
"bool": {
"must": [
{
"term": {
"search_field1": "field1",
"_cache": true
}
},
{
"term": {
"search_field2": "field2",
"_cache": true
}
}
]
}
},
"aggs": {
"term_aggregate": {
"terms": {
"field": "emails",
"size": 5,
"shard_size": 50
}
}
}
}
}
}


(Jimferenczi) #2

A simple term aggregation on one of the fields takes around 60s

I think you should start from here. The first query with a term aggregation on a "keyword" analyzed field takes time. Each shard needs to populate the fielddata for this particular field. 60s seems quite long, what do you mean by "keyword" analyzed ? You used the keyword analyzer in the definition of the field ?
What is the content of your field, is it big ?
What is the response time if you run the query several times ?


(photonic_world) #3

My mapping for the field looks like this:
"analysis":{
"analyzer":{
"lowercase_keyword_analyzer":{
"type":"custom",
"tokenizer":"keyword",
"filter":[
"lowercase"
]
}
}
}

...
"mappings":{
"type":{
"properties":{
...
"emails": {
"type": "string",
"analyzer": "lowercase_keyword_analyzer"
}
...
}
}
}

It is just and array of emails. Does it being an array matter?

Response time appears to be the same on an average. Doesn't decrease with subsequent invocations.


(Jimferenczi) #4

Ok thank you for the clarifications. Why are you using a keyword tokenizer ? Are you trying to find duplication in the mails ? The keyword tokenizer "tokenizes" an entire stream as a single token, this means that each mail in your aggregation counts for one entry. I suspect that the size of those tokens is problematic and is the reason why it's taking so much time. Can you describe your use case ?


(photonic_world) #5

The field contains array of email ids. I want them to be searchable as well. Do you think adding another field and making it a multi-field with one not_analyzed would improve performance?


(system) #6