Greetings,
Any advice into how to improve query speed for a specific use case would be greatly appreciated!
We currently have a cluster of 2 nodes (1 master and data, 1 data only), both running on AWS machines with the data folder on the recommended EBS for this kind of task. Each machine has 8 vCPU and 61GB RAM. 30GB of RAM is given to the ES Heap. The number of shards is currently set to 30 with 1 replica since we expect to grow to at least 30GB / shard in the upcoming months
Swapping has been set to be close to none on both servers.
We currently have an index with approx 500GB of data with 50 million documents.
The size of the index is due to the following:
- We're using ngrams to be able to find substrings inside bigger strings and avoid using double-wildcard which takes longer to process. Ngrams are applied to big fields (more than 100.000 characters)
- We're using keyword sub-fields on the same big fields but we ignore the documents which have over 8000 characters for removing duplicates via aggregations
The use case involves being able to search for text with special characters and also be able to find substrings inside bigger strings (i.e. find me$s in some$space) with highlighting. We're using the following analyzers:
"analysis": {
"filter": {
"ngram_filter": {
"type": "nGram",
"min_gram": 5,
"max_gram": 8
}
},
"analyzer": {
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
},
"whitespace_ngram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"ngram_filter"
]
}
}
}
The _all field has been disabled and we only search the fields we require.
Given this, we currently retrieve results with an average response time of around 3-4 seconds.
Is there anything else we could do to improve search speed apart from adding more nodes to the cluster ?
Any feedback is greatly appreciated!
Thanks!