Performance issue in my elastic search cluster

Vivek_Vardhan · August 27, 2019, 12:21pm

Hi Team

I am running a simple bool term query on an index in ES. It took 33152ms to respond.

Request -

POST    http://elastic_server_dns/index_name/_doc/_search

Body
    {
    "query": {
        "bool": {
        "must": [
            
             {
                "term": {
                     "orderNo":"abcdeffjfnjd"
                }
              
            }
        ]
        }
    }
    }

Attaching the response I got -

The first request took a whooping 33s, to search from a particular index.

The same request subsequently, taking 2-3 ms max. Why is this happening? Is there something wrong with my config?

A basic idea of my config

10 data nodes, 512 GB disk and 6GB ram, with 50% heap allocated to java process
3 master eligible nodes
Date based index, the above request was made on last day index
Trying same requests on different index, taking much less time (5ms max)
Disk space usage is around 6-7% as of now in each data node
This is happening very intermittently, because while tying the question, now I am not getting this issue.
The throughput of system using this is as of now is - 100 Requests per minute, which is way too less for ES

Output for free -m in one of the data node and its similar in all other data nodes

                     total      used        free         shared        buff/cache         available
       Mem:           6946      3943         268           0               2734              2635

Please let me know data points required.

Update -

I have added the timeout at my ingress level as 60s. after 60s I am getting exceptions like -

2019/08/27 13:26:02 [error] 39#39: *633505 upstream timed out (110: Connection timed out) 
while reading response header from upstream, client: xxx.xxx.xxx.xx, server: _, request: "POST 
/prefix-2019-08-26/_doc/_search HTTP/1.1", upstream: "http://xx.xxx.xx.x:9200/prefix-2019-08-26/_doc/_search", host: "xx.xx.xxx.x"

which means some time request is even taking more than 60s.

How can I debug the issue, I have set the ES data nodes and master node at log level debug, but no useful information I am getting there.

Update on issue

I monitored the cluster correctly, almost all the data nodes have very high CPU usage (99%-100% on 7-8 nodes out of 10), and that is the reason for slowness. below is screenshot of one the node

Does smaller RAM machine can be the reason for this -
6GB RAM, out of which 49% allocated for ES java process, and we can see heap is almost full in all the nodes (TOP output).

spinscale · August 28, 2019, 11:14am

This is just a guess, but it sounds to me, as if the filesystem cache is at work. When you start Elasticsearch and you are reading data from the filesystem the first time, this data is put into the filesystem cache, so that the next operation basically becomes an in-memory operation - speeding up subsequent queries - unless new data is written.

See https://www.elastic.co/guide/en/elasticsearch/reference/7.3/_pre_loading_data_into_the_file_system_cache.html

Vivek_Vardhan · August 28, 2019, 3:09pm

@spinscale I have updated the issue with more findings. Can you have a look?

spinscale · August 29, 2019, 7:28am

you can use the hot threads API to check in which code your CPU time is spent.

Vivek_Vardhan · August 29, 2019, 7:41am

@spinscale

Yes, I already used that. Pasting here the partial output of that

    ::: {data-node}{P24qlH20RGKKSIjDYvU6qw}{tQEJ-pcVTHqgWcnTK1v5xQ}
   Hot threads at 2019-08-28T11:02:33.604, interval=500ms, busiestThreads=3, 
 ignoreIdleThreads=true:

    45.3% (226.4ms out of 500ms) cpu usage by thread 'elasticsearch[data-node][search][T#1]'
     5/10 snapshots sharing following 20 elements
       app//org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:214)
       app//org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
       app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:670)
     app//org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:191)
   app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:471)
   app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:276)
   app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:114)
   app//org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:351)
   app//org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:398)
   app//org.elasticsearch.search.SearchService.access$100(SearchService.java:126)
   app//org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:360)
   app//org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:356)
   app//org.elasticsearch.search.SearchService$4.doRun(SearchService.java:1117)
   app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:759)
   app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   app//org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41)
   app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   java.base@11.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
   java.base@11.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   java.base@11.0.1/java.lang.Thread.run(Thread.java:834)

spinscale · August 29, 2019, 7:58am

that looks like a regular search. how many searches are running across how many shards in parallel? Is it still only a single search across five shards?

Vivek_Vardhan · August 29, 2019, 8:32am

@spinscale
Actually, its a Kibana query (prefix search on a particular text field), and running across index. We have daily indices and and the query was run with (*) on indices, means all the index (having 5 primary shards each), will get the query.

I am seeing this huge CPU in almost all the data nodes.

linkerc · August 29, 2019, 9:39pm

That's exactly what happens when you have too much data in the index.
First search takes forever, and exact search immediately afterward is very quick.
Without knowing your shard/replica and document count/size per shard info, it's hard to say.
My system behaves exactly the same when an index contains too many documents with too little shards.
Your query will hit all the documents in the index, assuming "orderNo" is in all the documents.

system · September 26, 2019, 9:39pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Running into Elasticsearch high search latency 5-10s issue in production Elasticsearch	13	3338	July 5, 2017
Our elastic search query performance is VERY low Elasticsearch	12	1649	May 11, 2017
Node works slow about 10 seconds after initialization Elasticsearch	11	449	July 6, 2017
Elasticsearch performance issue with load increasing reponse time is increasing Elasticsearch	5	443	July 6, 2017
Massive performance issues on our production cluster Elasticsearch	5	2592	July 6, 2017

Performance issue in my elastic search cluster

Related topics