Performance issue in my elastic search cluster

Hi Team

I am running a simple bool term query on an index in ES. It took 33152ms to respond.

Request -

POST    http://elastic_server_dns/index_name/_doc/_search

Body
    {
    "query": {
        "bool": {
        "must": [
            
             {
                "term": {
                     "orderNo":"abcdeffjfnjd"
                }
              
            }
        ]
        }
    }
    }

Attaching the response I got -

The first request took a whooping 33s, to search from a particular index.

The same request subsequently, taking 2-3 ms max. Why is this happening? Is there something wrong with my config?

A basic idea of my config

  • 10 data nodes, 512 GB disk and 6GB ram, with 50% heap allocated to java process

  • 3 master eligible nodes

  • Date based index, the above request was made on last day index

  • Trying same requests on different index, taking much less time (5ms max)

  • Disk space usage is around 6-7% as of now in each data node

  • This is happening very intermittently, because while tying the question, now I am not getting this issue.

  • The throughput of system using this is as of now is - 100 Requests per minute, which is way too less for ES

  • Output for free -m in one of the data node and its similar in all other data nodes

                         total      used        free         shared        buff/cache         available
           Mem:           6946      3943         268           0               2734              2635
    

Please let me know data points required.

Update -

I have added the timeout at my ingress level as 60s. after 60s I am getting exceptions like -

2019/08/27 13:26:02 [error] 39#39: *633505 upstream timed out (110: Connection timed out) 
while reading response header from upstream, client: xxx.xxx.xxx.xx, server: _, request: "POST 
/prefix-2019-08-26/_doc/_search HTTP/1.1", upstream: "http://xx.xxx.xx.x:9200/prefix-2019-08-26/_doc/_search", host: "xx.xx.xxx.x" 

which means some time request is even taking more than 60s.

How can I debug the issue, I have set the ES data nodes and master node at log level debug, but no useful information I am getting there.

Update on issue

I monitored the cluster correctly, almost all the data nodes have very high CPU usage (99%-100% on 7-8 nodes out of 10), and that is the reason for slowness. below is screenshot of one the node

Does smaller RAM machine can be the reason for this -
6GB RAM, out of which 49% allocated for ES java process, and we can see heap is almost full in all the nodes (TOP output).

This is just a guess, but it sounds to me, as if the filesystem cache is at work. When you start Elasticsearch and you are reading data from the filesystem the first time, this data is put into the filesystem cache, so that the next operation basically becomes an in-memory operation - speeding up subsequent queries - unless new data is written.

See https://www.elastic.co/guide/en/elasticsearch/reference/7.3/_pre_loading_data_into_the_file_system_cache.html

@spinscale I have updated the issue with more findings. Can you have a look?

you can use the hot threads API to check in which code your CPU time is spent.

@spinscale

Yes, I already used that. Pasting here the partial output of that

    ::: {data-node}{P24qlH20RGKKSIjDYvU6qw}{tQEJ-pcVTHqgWcnTK1v5xQ}
   Hot threads at 2019-08-28T11:02:33.604, interval=500ms, busiestThreads=3, 
 ignoreIdleThreads=true:

    45.3% (226.4ms out of 500ms) cpu usage by thread 'elasticsearch[data-node][search][T#1]'
     5/10 snapshots sharing following 20 elements
       app//org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:214)
       app//org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
       app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:670)
     app//org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:191)
   app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:471)
   app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:276)
   app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:114)
   app//org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:351)
   app//org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:398)
   app//org.elasticsearch.search.SearchService.access$100(SearchService.java:126)
   app//org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:360)
   app//org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:356)
   app//org.elasticsearch.search.SearchService$4.doRun(SearchService.java:1117)
   app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:759)
   app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   app//org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41)
   app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   java.base@11.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
   java.base@11.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   java.base@11.0.1/java.lang.Thread.run(Thread.java:834)

that looks like a regular search. how many searches are running across how many shards in parallel? Is it still only a single search across five shards?

@spinscale
Actually, its a Kibana query (prefix search on a particular text field), and running across index. We have daily indices and and the query was run with (*) on indices, means all the index (having 5 primary shards each), will get the query.

I am seeing this huge CPU in almost all the data nodes.

That's exactly what happens when you have too much data in the index.
First search takes forever, and exact search immediately afterward is very quick.
Without knowing your shard/replica and document count/size per shard info, it's hard to say.
My system behaves exactly the same when an index contains too many documents with too little shards.
Your query will hit all the documents in the index, assuming "orderNo" is in all the documents.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.