First time elasticsearch "query" time is too high on a collection with 25 million documents.
Our end users are empowered with free text search and ability to search by synonyms.
The problem basically is "first time query operations are pretty slow".
Understand that for subsequent calls its cached and the average time across requests is acceptable.
However, we want to understand the ROOT CAUSE and unable to identify the same.
The setup consists of:
- 3 nodes with 4 cores and 16 GB RAM each and 7 GB allocated to elastic search.
- 800 GB SSD hard disk
We have identified 2 possible solutions:
1.Configure synonyms only at index time
Consider main term and synonyms at equal weightage using synonym rules txt.
In other words, expand all to each. Index size 60GB
2.Configure synonyms at index + query time
Thereby returning exact input text matched documents first, followed by synonyms
Achieved by different synonym rules at index and query time respectively. Index size 300GB.
Details:
No of Documents: 25 million
Document Size: 3 kb
Analyzed Field: Synonyms Text Analyzer match on title and description
Not Analyzed: Around 38 fields (in the same index)
Nature of query: mostly multiword. Examples "lung cancer and EGFR" or "lung cancer and gastric cancer"
Structure of document: mostly flat and above fields are at first level
No of replicas: 1
No of shards: 6
First time queries (not filter) take anywhere between 4-8 secs.
Subsequent queries (ofcourse cached) return in < 200 ms average.
Questions:
What may be the possible root cause of this behaviour?
How do we go about troubleshooting?
Are there any solution or alternate approach to this problem statement.
Observations so far:
-
In the current setup, our cluster / node resources are NOT being effectively used.
We hardly have any load and yet the queries are slow, most of the heap is free and the entire index looks like is NOT loaded (no warmers in 2.4).
Tested this by shutting down all nodes & running queries only against one node results in 33% faster reuslts.
The heap usage is just 7%.
We are currently in process of gathering more metrics. -
There may be a challenge with our cluster setup and communication between nodes.
Our DevOps team will look into this ASAP.
Index settings and mappings:
Further Update (22nd Dec, 2016):
- We tried using fielddata loading set to "eager" and found results to be 6-7 times faster in both a cluster and single node (hardware upgraded to 32GB RAM from 16GB). The only drawback being the JVM heap stretched to its limits which anyways will be the case on consistent usage. The fielddata cache limit has been set to 70% of the heap size allocated. Even then we see 8 - 10 GB of heap being used out of 14GB (on 32GB nodes). Is there any downside of this approach?