I understand this is the most sought after topic in elasticsearch, I see lot of answers but haven't found anything convincing. Here is the problem:
I have a monthly index of 5 primary shards and 1 replica for each on 5 data nodes.
Hardware:
8 CPUs, 32 G RAM and 16G of heap. The field data circuit breaker is set at 30% and indices.breaker.total.limit is at 70%.
Number of documents on these indices are around ~100 mil. Each of these documents are about 150k in size. All of the fields are keyword analyzed.
A simple term aggregation on one of the fields takes around 60s, this grows with data in the index. I further reduced the set on which aggregations happen by using filter aggregation here is my query
What I do not understand is that running this query with just the filter aggregation filter_agg takes ~ 1s and returns 157 documents adding the term_aggregate causes the aggregate query to take more than 100s.
Am I missing something here, is there something wrong with the query?
Does the term_aggregate aggregate 157 documents which resulted from filter_agg?
A simple term aggregation on one of the fields takes around 60s
I think you should start from here. The first query with a term aggregation on a "keyword" analyzed field takes time. Each shard needs to populate the fielddata for this particular field. 60s seems quite long, what do you mean by "keyword" analyzed ? You used the keyword analyzer in the definition of the field ?
What is the content of your field, is it big ?
What is the response time if you run the query several times ?
My mapping for the field looks like this:
"analysis":{
"analyzer":{
"lowercase_keyword_analyzer":{
"type":"custom",
"tokenizer":"keyword",
"filter":[
"lowercase"
]
}
}
}
Ok thank you for the clarifications. Why are you using a keyword tokenizer ? Are you trying to find duplication in the mails ? The keyword tokenizer "tokenizes" an entire stream as a single token, this means that each mail in your aggregation counts for one entry. I suspect that the size of those tokens is problematic and is the reason why it's taking so much time. Can you describe your use case ?
The field contains array of email ids. I want them to be searchable as well. Do you think adding another field and making it a multi-field with one not_analyzed would improve performance?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.