High CPU usage of the elastic

Hi pals,

I have installed a single node of the elastic on a server with following configuration:

OS: Windows Server 2016
Processors: 2 processors of the Intel(R) Xeon CPU E5-2699 v4 @ 2.20GHZ
RAM: 32 GB (16 GB dedicated to the JVM)

I get high CPU usages (often over 70%) during the day. When getting the hot
thread, I get the following messages in different time slots:

::: {ELASTIC-SRV1}{Pbra1KbPRzGKWxoIxPg0Lw}{CW6FO8xHTCaPXYUx_zuKQg}{<server address>}{<elastic address>}{ml.machine_memory=34359267328, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}
   Hot threads at 2019-06-23T07:41:51.708, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   37.5% (187.5ms out of 500ms) cpu usage by thread 'elasticsearch[ELASTIC-SRV1][refresh][T#1]'
     4/10 snapshots sharing following 30 elements
       org.apache.lucene.index.IndexWriter.createCompoundFile(IndexWriter.java:4997)
       org.apache.lucene.index.DocumentsWriterPerThread.sealFlushedSegment(DocumentsWriterPerThread.java:576)
       org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:515)
       org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:554)
       org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:719)
       org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:492)
       org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:294)
       org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:269)
       org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:259)
       org.apache.lucene.index.FilterDirectoryReader.doOpenIfChanged(FilterDirectoryReader.java:112)
       org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:140)
       org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:156)
       org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:58)
       org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176)
       org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253)
       org.elasticsearch.index.engine.InternalEngine$ExternalSearcherManager.refreshIfNeeded(InternalEngine.java:322)
       org.elasticsearch.index.engine.InternalEngine$ExternalSearcherManager.refreshIfNeeded(InternalEngine.java:297)
       org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176)
       org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253)
       org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:1569)
       org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:1550)
       org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:903)
       org.elasticsearch.index.IndexService.maybeRefreshEngine(IndexService.java:719)
       org.elasticsearch.index.IndexService.access$400(IndexService.java:100)
       org.elasticsearch.index.IndexService$AsyncRefreshTask.runInternal(IndexService.java:921)
       org.elasticsearch.index.IndexService$BaseAsyncTask.run(IndexService.java:831)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
     6/10 snapshots sharing following 24 elements
       org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:294)
       org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:269)
       org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:259)
       org.apache.lucene.index.FilterDirectoryReader.doOpenIfChanged(FilterDirectoryReader.java:112)
       org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:140)
       org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:156)
       org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:58)
       org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176)
       org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253)
       org.elasticsearch.index.engine.InternalEngine$ExternalSearcherManager.refreshIfNeeded(InternalEngine.java:322)
       org.elasticsearch.index.engine.InternalEngine$ExternalSearcherManager.refreshIfNeeded(InternalEngine.java:297)
       org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176)
       org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253)
       org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:1569)
       org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:1550)
       org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:903)
       org.elasticsearch.index.IndexService.maybeRefreshEngine(IndexService.java:719)
       org.elasticsearch.index.IndexService.access$400(IndexService.java:100)
       org.elasticsearch.index.IndexService$AsyncRefreshTask.runInternal(IndexService.java:921)
       org.elasticsearch.index.IndexService$BaseAsyncTask.run(IndexService.java:831)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)

and

25.0% (125ms out of 500ms) cpu usage by thread 'elasticsearch[ELASTIC-SRV1][write][T#1]'
     3/10 snapshots sharing following 2 elements
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)

On my elastic node, I have high writing operations, from the logs of our applications to the logs filebeat and metricbeat send from 3 different servers. The read operation is not crucial at this stage for us, but the write is.

I have also changed the refresh_interval from 3 seconds (the default value) to 30 seconds for each index, but the CPU usage has not changed significantly.

Could you please tell me the way I can reduce the CPU usage?

Indexing can be quite CPU intensive, so 70% CPU usage does not sound that much to me. Is this causing any problems?

How much data are you indexing per day? What type of storage do you have? SSDs?

Well, the infrastructure team asked me to explain the reason of such high CPU usage and also the ways to reduce such usage.

Every day, around 5GB of data is stored in our elastic database. The server is equipped with HDD.

Is there any way to reduce the CPU usage? Is it normal?

I know nothing about your mappings or data nor whether you are using any ingest pipelines. Is your data processing spread out evenly over time or do you have peaks coinciding with the high CPU usage?

For the mappings, I am using the default mappings of the filebeat and metricbeat. For the application logs, I am using the default mapping Serilog makes.

For the filebeat, I have added a custom pipeline to insert the log data from IIS. Here is my custom ingest pipeline:

{
  "description": "Pipeline for parsing IIS access logs. Requires the geoip and user_agent plugins.",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns":[
        "%{TIMESTAMP_ISO8601:iis.access.time} %{NOTSPACE:iis.access.site_name} %{NOTSPACE:iis.access.server_name} %{IPORHOST:iis.access.server_ip} %{WORD:iis.access.method} %{URIPATH:iis.access.url} %{NOTSPACE:iis.access.query_string} %{NUMBER:iis.access.port} %{NOTSPACE:iis.access.user_name} %{IPORHOST:iis.access.remote_ip} HTTP/%{NUMBER:iis.access.http_version} %{NOTSPACE:iis.access.agent} %{NOTSPACE:iis.access.referrer} %{NOTSPACE:iis.access.hostname} %{NUMBER:iis.access.response_code} %{NUMBER:iis.access.sub_status} %{NUMBER:iis.access.request_time_ms} %{NOTSPACE:iis.access.X-Forwarded-For}",
        "%{TIMESTAMP_ISO8601:date_time} %{WORD:s_sitename} %{HOSTNAME:s_computername} %{IP:s_ip} %{WORD:cs_method} %{URIPATH:cs_uri_stem} %{NOTSPACE:cs_uri_query} %{NUMBER:s_port} %{NOTSPACE:cs_username} %{IPORHOST:c_ip} %{NOTSPACE:cs_version} %{NOTSPACE:cs_user_agent} %{NOTSPACE:cs_cookie} %{NOTSPACE:cs_referer} %{NOTSPACE:cs_host} %{NUMBER:sc_status} %{NUMBER:sc_substatus} %{NUMBER:sc_win32_status} %{NUMBER:sc_bytes} %{NUMBER:cs_bytes} %{NUMBER:time_taken} %{NOTSPACE:iis.access.X-Forwarded-For}",
		"%{TIMESTAMP_ISO8601:iis.access.time} %{IPORHOST:iis.access.server_ip} %{WORD:iis.access.method} %{URIPATH:iis.access.url} %{NOTSPACE:iis.access.query_string} %{NUMBER:iis.access.port} %{NOTSPACE:iis.access.user_name} %{IPORHOST:iis.access.remote_ip} %{NOTSPACE:iis.access.agent} %{NOTSPACE:iis.access.referrer} %{NUMBER:iis.access.response_code} %{NUMBER:iis.access.sub_status} %{NUMBER:iis.access.win32_status} %{NUMBER:iis.access.request_time_ms} %{NOTSPACE:iis.access.X-Forwarded-For}",
        "%{TIMESTAMP_ISO8601:iis.access.time} %{NOTSPACE:iis.access.site_name} %{WORD:iis.access.method} %{URIPATH:iis.access.url} %{NOTSPACE:iis.access.query_string} %{NUMBER:iis.access.port} %{NOTSPACE:iis.access.user_name} %{IPORHOST:iis.access.remote_ip} %{NOTSPACE:iis.access.agent} %{NOTSPACE:iis.access.cookie} %{NOTSPACE:iis.access.referrer} %{NOTSPACE:iis.access.hostname} %{NUMBER:iis.access.response_code} %{NUMBER:iis.access.sub_status} %{NUMBER:iis.access.win32_status} %{NUMBER:iis.access.body_sent.bytes} %{NUMBER:iis.access.body_received.bytes} %{NUMBER:iis.access.request_time_ms} %{NOTSPACE:iis.access.X-Forwarded-For}",
        "%{TIMESTAMP_ISO8601:iis.access.time} %{NOTSPACE:iis.access.site_name} %{NOTSPACE:iis.access.server_name} %{IPORHOST:iis.access.server_ip} %{WORD:iis.access.method} %{URIPATH:iis.access.url} %{NOTSPACE:iis.access.query_string} %{NUMBER:iis.access.port} %{NOTSPACE:iis.access.user_name} %{IPORHOST:iis.access.remote_ip} HTTP/%{NUMBER:iis.access.http_version} %{NOTSPACE:iis.access.agent} %{NOTSPACE:iis.access.cookie} %{NOTSPACE:iis.access.referrer} %{NOTSPACE:iis.access.hostname} %{NUMBER:iis.access.response_code} %{NUMBER:iis.access.sub_status} %{NUMBER:iis.access.win32_status} %{NUMBER:iis.access.body_sent.bytes} %{NUMBER:iis.access.body_received.bytes} %{NUMBER:iis.access.request_time_ms} %{NOTSPACE:iis.access.X-Forwarded-For}",
        "%{TIMESTAMP_ISO8601:iis.access.time} %{NOTSPACE:iis.access.site_name} %{NOTSPACE:iis.access.server_name} %{WORD:iis.access.method} %{URIPATH:iis.access.url} %{NOTSPACE:iis.access.query_string} %{NUMBER:iis.access.port} %{NOTSPACE:iis.access.user_name} %{IPORHOST:iis.access.remote_ip} HTTP/%{NUMBER:iis.access.http_version} %{NOTSPACE:iis.access.agent} %{NOTSPACE:iis.access.cookie} %{NOTSPACE:iis.access.referrer} %{NOTSPACE:iis.access.hostname} %{NUMBER:iis.access.response_code} %{NUMBER:iis.access.sub_status} %{NUMBER:iis.access.win32_status} %{NUMBER:iis.access.body_sent.bytes} %{NUMBER:iis.access.body_received.bytes} %{NUMBER:iis.access.request_time_ms} %{NOTSPACE:iis.access.X-Forwarded-For}"
]
      }
    },
    {
      "date": {
        "field": "date_time",
        "formats": [ "yyyy-MM-dd HH:mm:ss" ]
      }
    }   
  ],
  "on_failure" : [
    {
      "set" : {
        "field" : "_index",
        "value" : "{{ _index }}"
      }
    }
  ]
}

We do not have any peaks during the day, since our data insertion is spread out evenly. If there should be any documents for a better decision, please let me know.

Best,
Behzad

Grok can be quite CPU intensive, but it looks like you have quite well formed exceptions that avoid multiple DATA and GREEDYDATA patterns, which is good. I do not see any problems here.

Dear Christian,

Could you please let me know about the possible suitable ways I can follow to overcome the high CPU usage? Should we add more CPU cores? Add SSD? Decrease the volume of the data? ...?

As CPU usage only peaks at 70% i do not understand what the problem is. Are you having performance problems?

Here are the CPU usages by percentage and MHz. As shown, the usage exceeds to 100% in some cases and I do not know the reason. Is it a usual behavior? Should we add any resources?
An important note about the usage by MHz is that, we have 2 CPUs, each of 2.20 GHz (totally 4.40 GHz). In the second picture, the MHz goes up to 6 GHz showing that the elastic server is using the shared CPU resource from the other VMs and that is deeply dangerous for us!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.