High CPU usage of the elastic

Behzad_Rezaie · June 23, 2019, 7:59am

Hi pals,

I have installed a single node of the elastic on a server with following configuration:

OS: Windows Server 2016
Processors: 2 processors of the Intel(R) Xeon CPU E5-2699 v4 @ 2.20GHZ
RAM: 32 GB (16 GB dedicated to the JVM)

I get high CPU usages (often over 70%) during the day. When getting the hot
thread, I get the following messages in different time slots:

::: {ELASTIC-SRV1}{Pbra1KbPRzGKWxoIxPg0Lw}{CW6FO8xHTCaPXYUx_zuKQg}{<server address>}{<elastic address>}{ml.machine_memory=34359267328, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}
   Hot threads at 2019-06-23T07:41:51.708, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   37.5% (187.5ms out of 500ms) cpu usage by thread 'elasticsearch[ELASTIC-SRV1][refresh][T#1]'
     4/10 snapshots sharing following 30 elements
       org.apache.lucene.index.IndexWriter.createCompoundFile(IndexWriter.java:4997)
       org.apache.lucene.index.DocumentsWriterPerThread.sealFlushedSegment(DocumentsWriterPerThread.java:576)
       org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:515)
       org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:554)
       org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:719)
       org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:492)
       org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:294)
       org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:269)
       org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:259)
       org.apache.lucene.index.FilterDirectoryReader.doOpenIfChanged(FilterDirectoryReader.java:112)
       org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:140)
       org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:156)
       org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:58)
       org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176)
       org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253)
       org.elasticsearch.index.engine.InternalEngine$ExternalSearcherManager.refreshIfNeeded(InternalEngine.java:322)
       org.elasticsearch.index.engine.InternalEngine$ExternalSearcherManager.refreshIfNeeded(InternalEngine.java:297)
       org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176)
       org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253)
       org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:1569)
       org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:1550)
       org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:903)
       org.elasticsearch.index.IndexService.maybeRefreshEngine(IndexService.java:719)
       org.elasticsearch.index.IndexService.access$400(IndexService.java:100)
       org.elasticsearch.index.IndexService$AsyncRefreshTask.runInternal(IndexService.java:921)
       org.elasticsearch.index.IndexService$BaseAsyncTask.run(IndexService.java:831)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
     6/10 snapshots sharing following 24 elements
       org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:294)
       org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:269)
       org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:259)
       org.apache.lucene.index.FilterDirectoryReader.doOpenIfChanged(FilterDirectoryReader.java:112)
       org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:140)
       org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:156)
       org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:58)
       org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176)
       org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253)
       org.elasticsearch.index.engine.InternalEngine$ExternalSearcherManager.refreshIfNeeded(InternalEngine.java:322)
       org.elasticsearch.index.engine.InternalEngine$ExternalSearcherManager.refreshIfNeeded(InternalEngine.java:297)
       org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176)
       org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253)
       org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:1569)
       org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:1550)
       org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:903)
       org.elasticsearch.index.IndexService.maybeRefreshEngine(IndexService.java:719)
       org.elasticsearch.index.IndexService.access$400(IndexService.java:100)
       org.elasticsearch.index.IndexService$AsyncRefreshTask.runInternal(IndexService.java:921)
       org.elasticsearch.index.IndexService$BaseAsyncTask.run(IndexService.java:831)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)

and

25.0% (125ms out of 500ms) cpu usage by thread 'elasticsearch[ELASTIC-SRV1][write][T#1]'
     3/10 snapshots sharing following 2 elements
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)

On my elastic node, I have high writing operations, from the logs of our applications to the logs filebeat and metricbeat send from 3 different servers. The read operation is not crucial at this stage for us, but the write is.

I have also changed the refresh_interval from 3 seconds (the default value) to 30 seconds for each index, but the CPU usage has not changed significantly.

Could you please tell me the way I can reduce the CPU usage?

Christian_Dahlqvist · June 23, 2019, 10:37am

Indexing can be quite CPU intensive, so 70% CPU usage does not sound that much to me. Is this causing any problems?

How much data are you indexing per day? What type of storage do you have? SSDs?

Behzad_Rezaie · June 23, 2019, 11:28am

Well, the infrastructure team asked me to explain the reason of such high CPU usage and also the ways to reduce such usage.

Every day, around 5GB of data is stored in our elastic database. The server is equipped with HDD.

Is there any way to reduce the CPU usage? Is it normal?

Christian_Dahlqvist · June 23, 2019, 11:32am

I know nothing about your mappings or data nor whether you are using any ingest pipelines. Is your data processing spread out evenly over time or do you have peaks coinciding with the high CPU usage?

Behzad_Rezaie · June 24, 2019, 5:39am

For the mappings, I am using the default mappings of the filebeat and metricbeat. For the application logs, I am using the default mapping Serilog makes.

For the filebeat, I have added a custom pipeline to insert the log data from IIS. Here is my custom ingest pipeline:

{
  "description": "Pipeline for parsing IIS access logs. Requires the geoip and user_agent plugins.",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns":[
        "%{TIMESTAMP_ISO8601:iis.access.time} %{NOTSPACE:iis.access.site_name} %{NOTSPACE:iis.access.server_name} %{IPORHOST:iis.access.server_ip} %{WORD:iis.access.method} %{URIPATH:iis.access.url} %{NOTSPACE:iis.access.query_string} %{NUMBER:iis.access.port} %{NOTSPACE:iis.access.user_name} %{IPORHOST:iis.access.remote_ip} HTTP/%{NUMBER:iis.access.http_version} %{NOTSPACE:iis.access.agent} %{NOTSPACE:iis.access.referrer} %{NOTSPACE:iis.access.hostname} %{NUMBER:iis.access.response_code} %{NUMBER:iis.access.sub_status} %{NUMBER:iis.access.request_time_ms} %{NOTSPACE:iis.access.X-Forwarded-For}",
        "%{TIMESTAMP_ISO8601:date_time} %{WORD:s_sitename} %{HOSTNAME:s_computername} %{IP:s_ip} %{WORD:cs_method} %{URIPATH:cs_uri_stem} %{NOTSPACE:cs_uri_query} %{NUMBER:s_port} %{NOTSPACE:cs_username} %{IPORHOST:c_ip} %{NOTSPACE:cs_version} %{NOTSPACE:cs_user_agent} %{NOTSPACE:cs_cookie} %{NOTSPACE:cs_referer} %{NOTSPACE:cs_host} %{NUMBER:sc_status} %{NUMBER:sc_substatus} %{NUMBER:sc_win32_status} %{NUMBER:sc_bytes} %{NUMBER:cs_bytes} %{NUMBER:time_taken} %{NOTSPACE:iis.access.X-Forwarded-For}",
		"%{TIMESTAMP_ISO8601:iis.access.time} %{IPORHOST:iis.access.server_ip} %{WORD:iis.access.method} %{URIPATH:iis.access.url} %{NOTSPACE:iis.access.query_string} %{NUMBER:iis.access.port} %{NOTSPACE:iis.access.user_name} %{IPORHOST:iis.access.remote_ip} %{NOTSPACE:iis.access.agent} %{NOTSPACE:iis.access.referrer} %{NUMBER:iis.access.response_code} %{NUMBER:iis.access.sub_status} %{NUMBER:iis.access.win32_status} %{NUMBER:iis.access.request_time_ms} %{NOTSPACE:iis.access.X-Forwarded-For}",
        "%{TIMESTAMP_ISO8601:iis.access.time} %{NOTSPACE:iis.access.site_name} %{WORD:iis.access.method} %{URIPATH:iis.access.url} %{NOTSPACE:iis.access.query_string} %{NUMBER:iis.access.port} %{NOTSPACE:iis.access.user_name} %{IPORHOST:iis.access.remote_ip} %{NOTSPACE:iis.access.agent} %{NOTSPACE:iis.access.cookie} %{NOTSPACE:iis.access.referrer} %{NOTSPACE:iis.access.hostname} %{NUMBER:iis.access.response_code} %{NUMBER:iis.access.sub_status} %{NUMBER:iis.access.win32_status} %{NUMBER:iis.access.body_sent.bytes} %{NUMBER:iis.access.body_received.bytes} %{NUMBER:iis.access.request_time_ms} %{NOTSPACE:iis.access.X-Forwarded-For}",
        "%{TIMESTAMP_ISO8601:iis.access.time} %{NOTSPACE:iis.access.site_name} %{NOTSPACE:iis.access.server_name} %{IPORHOST:iis.access.server_ip} %{WORD:iis.access.method} %{URIPATH:iis.access.url} %{NOTSPACE:iis.access.query_string} %{NUMBER:iis.access.port} %{NOTSPACE:iis.access.user_name} %{IPORHOST:iis.access.remote_ip} HTTP/%{NUMBER:iis.access.http_version} %{NOTSPACE:iis.access.agent} %{NOTSPACE:iis.access.cookie} %{NOTSPACE:iis.access.referrer} %{NOTSPACE:iis.access.hostname} %{NUMBER:iis.access.response_code} %{NUMBER:iis.access.sub_status} %{NUMBER:iis.access.win32_status} %{NUMBER:iis.access.body_sent.bytes} %{NUMBER:iis.access.body_received.bytes} %{NUMBER:iis.access.request_time_ms} %{NOTSPACE:iis.access.X-Forwarded-For}",
        "%{TIMESTAMP_ISO8601:iis.access.time} %{NOTSPACE:iis.access.site_name} %{NOTSPACE:iis.access.server_name} %{WORD:iis.access.method} %{URIPATH:iis.access.url} %{NOTSPACE:iis.access.query_string} %{NUMBER:iis.access.port} %{NOTSPACE:iis.access.user_name} %{IPORHOST:iis.access.remote_ip} HTTP/%{NUMBER:iis.access.http_version} %{NOTSPACE:iis.access.agent} %{NOTSPACE:iis.access.cookie} %{NOTSPACE:iis.access.referrer} %{NOTSPACE:iis.access.hostname} %{NUMBER:iis.access.response_code} %{NUMBER:iis.access.sub_status} %{NUMBER:iis.access.win32_status} %{NUMBER:iis.access.body_sent.bytes} %{NUMBER:iis.access.body_received.bytes} %{NUMBER:iis.access.request_time_ms} %{NOTSPACE:iis.access.X-Forwarded-For}"
]
      }
    },
    {
      "date": {
        "field": "date_time",
        "formats": [ "yyyy-MM-dd HH:mm:ss" ]
      }
    }   
  ],
  "on_failure" : [
    {
      "set" : {
        "field" : "_index",
        "value" : "{{ _index }}"
      }
    }
  ]
}

We do not have any peaks during the day, since our data insertion is spread out evenly. If there should be any documents for a better decision, please let me know.

Best,
Behzad

Christian_Dahlqvist · June 24, 2019, 5:46am

Grok can be quite CPU intensive, but it looks like you have quite well formed exceptions that avoid multiple DATA and GREEDYDATA patterns, which is good. I do not see any problems here.

Behzad_Rezaie · June 25, 2019, 4:37am

Dear Christian,

Could you please let me know about the possible suitable ways I can follow to overcome the high CPU usage? Should we add more CPU cores? Add SSD? Decrease the volume of the data? ...?

Christian_Dahlqvist · June 25, 2019, 5:00am

As CPU usage only peaks at 70% i do not understand what the problem is. Are you having performance problems?

Behzad_Rezaie · June 25, 2019, 8:05am

Here are the CPU usages by percentage and MHz. As shown, the usage exceeds to 100% in some cases and I do not know the reason. Is it a usual behavior? Should we add any resources?
An important note about the usage by MHz is that, we have 2 CPUs, each of 2.20 GHz (totally 4.40 GHz). In the second picture, the MHz goes up to 6 GHz showing that the elastic server is using the shared CPU resource from the other VMs and that is deeply dangerous for us!

system · July 23, 2019, 8:05am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
High CPU usage / High load Elasticsearch	3	610	June 10, 2020
Lucene Merge Thread causing high CPU usage Elasticsearch	3	1549	May 17, 2021
CPU High load Elasticsearch	8	2627	January 18, 2018
My Elasticsearch is running at very high CPU (constantly 99%) - Need help understanding hot_threads Elasticsearch	2	1024	July 6, 2017
High CPU usage, help Elasticsearch	1	401	July 5, 2017

High CPU usage of the elastic

Related topics