First time queries (not filters) to Elasticsearch takes lot of time

First time elasticsearch "query" time is too high on a collection with 25 million documents.
Our end users are empowered with free text search and ability to search by synonyms.
The problem basically is "first time query operations are pretty slow".
Understand that for subsequent calls its cached and the average time across requests is acceptable.
However, we want to understand the ROOT CAUSE and unable to identify the same.

The setup consists of:

  • 3 nodes with 4 cores and 16 GB RAM each and 7 GB allocated to elastic search.
  • 800 GB SSD hard disk

We have identified 2 possible solutions:

1.Configure synonyms only at index time
Consider main term and synonyms at equal weightage using synonym rules txt.
In other words, expand all to each. Index size 60GB

2.Configure synonyms at index + query time
Thereby returning exact input text matched documents first, followed by synonyms
Achieved by different synonym rules at index and query time respectively. Index size 300GB.

Details:
No of Documents: 25 million
Document Size: 3 kb
Analyzed Field: Synonyms Text Analyzer match on title and description
Not Analyzed: Around 38 fields (in the same index)
Nature of query: mostly multiword. Examples "lung cancer and EGFR" or "lung cancer and gastric cancer"
Structure of document: mostly flat and above fields are at first level
No of replicas: 1
No of shards: 6

First time queries (not filter) take anywhere between 4-8 secs.
Subsequent queries (ofcourse cached) return in < 200 ms average.

Questions:
What may be the possible root cause of this behaviour?
How do we go about troubleshooting?
Are there any solution or alternate approach to this problem statement.

Observations so far:

  1. In the current setup, our cluster / node resources are NOT being effectively used.
    We hardly have any load and yet the queries are slow, most of the heap is free and the entire index looks like is NOT loaded (no warmers in 2.4).
    Tested this by shutting down all nodes & running queries only against one node results in 33% faster reuslts.
    The heap usage is just 7%.
    We are currently in process of gathering more metrics.

  2. There may be a challenge with our cluster setup and communication between nodes.
    Our DevOps team will look into this ASAP.

Index settings and mappings:

Further Update (22nd Dec, 2016):

  • We tried using fielddata loading set to "eager" and found results to be 6-7 times faster in both a cluster and single node (hardware upgraded to 32GB RAM from 16GB). The only drawback being the JVM heap stretched to its limits which anyways will be the case on consistent usage. The fielddata cache limit has been set to 70% of the heap size allocated. Even then we see 8 - 10 GB of heap being used out of 14GB (on 32GB nodes). Is there any downside of this approach?
1 Like

So do we take from this that the queries are performing aggregations or sorting on custom fields? Otherwise fielddata is not used in basic searches.
How many docs are you retrieving in your responses and are you using highlighting?

We are doing basic searches (simple match queries) on a field which contains abstract of research papers, it a simple free-text search.

We retrieve 20 docs in our response.

We are not using highlighting.

A sample query is like this:
{ "_source":["year_str","pmid", "article_title"], "size": 20, "query":{ "bool": { "must": [ { "match": { "abstract": "gefitinib" } }, { "match": { "abstract": "paracetamol" } } ] } } }

OK so that query does not look like it would use fielddata at all.

Just how big are your docs?
And how many synonyms are your problem search terms being expanded to?

Each documents is around 3Kb. It contains around 40 keys value pairs. I am unable to post a sample document because of the limitation on number of characters.

The number of synonyms search terms are expanded to depends on the synonyms rule, when I use the _analyze endpoint, on an average its 20 tokens.

OK those numbers don't sound too scary.

Going back to this comment:

That was disk singular. Does this mean you're using some form of NFS or did you mean local disks (plural)?

We are using local disks(plural), one 800GB SSD for one node respectively. The whole cluster is setup on Google compute. NO, we aren't using any form of NFS here.

ES version is 2.4.1

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.