Disk storage not increasing despite indexes being growing

viniciof · December 11, 2019, 9:20pm

Hi all,

Noticed today something very weird, our Logstash workers keep emitting data consistently to two of our ES clusters, rate as expected.

Disk storage in one of them gets consumed as expected also, but the other remains flat (and there are rejected events in thread pool in most of the data nodes)

Problematic cluster

Good cluster

how can I find out what's wrong ?

rgds,

Christian_Dahlqvist · December 12, 2019, 4:45am

What does your Logstash outputs look like?

viniciof · December 12, 2019, 3:13pm

They send bulk requests, like the one as follows:

gist.github.com

https://gist.github.com/vinicioflores/a922e3f9de75454203d77c9e2b8b3dbd

bulk_requests.json


    {
      "nodes": {
        "a5voC7oXSf-78LfPlJ45Yg": {
          "name": "arm-or-009_ingest",
          "transport_address": "**.**.**.**:9600",
          "host": "mytinynode1.server.com",
          "ip": "**.**.**.**:9600",
          "roles": [
            "ingest"

This file has been truncated. show original

Christian_Dahlqvist · December 12, 2019, 4:59pm

What does the configuration look like? Do you have a separate output plugin per cluster?

viniciof · December 12, 2019, 9:23pm

I have separate output plugin for each, here the Logstash pipeline:

Problematic one

output {
  elasticsearch {
   hosts => [
        "arm-or-006.myserver.com:9996",
        "arm-or-007.myserver.com:9996",
        "arm-or-008.myserver.com:9996",
        "arm-or-010.myserver.com:9996"
    ]
    ssl => true
    cacert => "/app/ssl/cert.pem"
    user => "myuser"
    password => "mypass"
    document_type => "arm"
    document_id => "%{ibi_id}"
    index => "%{ibi_target}-%{+YYYY-MM}"
    doc_as_upsert => true
    action => "update"
    retry_max_interval => 5
    retry_on_conflict => 5
    flush_size => 10000
    timeout => 1000000
  }
}

Good one

output {
  elasticsearch {
    hosts => [
        "arm-lc-001.myserver.com:9996",
        "arm-lc-003.myserver.com:9996",
        "arm-lc-004.myserver.com:9996",
        "arm-lc-005.myserver.com:9996"
    ]
    ssl => true
    cacert => "/app/ssl/cert.pem"
    ssl_certificate_verification => false
    user => "myuser"
    password => "mypass"
    document_type => "arm"
    document_id => "%{ibi_id}"
    index => "%{ibi_target}-%{+YYYY-MM}"
    doc_as_upsert => true
    action => "update"
    retry_max_interval => 5
    retry_on_conflict => 5
    flush_size => 10000
    timeout => 1000000
  }
}

viniciof · December 12, 2019, 11:56pm

Also, this is what I see in hot threads:

gist.github.com

https://gist.github.com/vinicioflores/5a150b1a25e094ef16a15ed9690b4b51

gistfile1.txt

::: {arm-or-001_master}{vACvBzd5RiqvDDgWNvY6EQ}{TrPh5fRwRjKvxpYgV2h_oA}{plxcq8197.myserver.com}{x.x.x.x:9301}{ibi_site=pdx, box_type=hot}
   Hot threads at 2019-12-12T23:52:10.447Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
    0.0% (50.8micros out of 500ms) cpu usage by thread 'elasticsearch[arm-or-001_master][[timer]]'
     10/10 snapshots sharing following 2 elements
       java.lang.Thread.sleep(Native Method)
       org.elasticsearch.threadpool.ThreadPool$CachedTimeThread.run(ThreadPool.java:541)

::: {arm-or-009_data}{f0aSRTkDQommICaE9nQRBg}{JMnusso4RfCBpFObfysMTA}{plxcq8205.myserver.com}{x.x.x.x:9300}{ibi_site=pdx, box_type=hot}
   Hot threads at 2019-12-12T23:52:10.410Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

This file has been truncated. show original

@Christian_Dahlqvist , what does this mean ?

Christian_Dahlqvist · December 13, 2019, 6:50am

I have never run bulk updates so am not sure if errors here would cause the update to be retried from Logstash or simply dropped. You seem to have a lot of time spent on management. Do you have a very large number of shards in the cluster? Are you using dynamic mappings? Does the hardware profiles supporting the cluster s differ, especially with respect to the type of storage used? Is there anything in the Elasticsearch logs?

viniciof · December 13, 2019, 5:10pm

Here is all sharding info in my cluster:

gist.github.com

https://gist.github.com/vinicioflores/5253cf93a6e53da6fa774d2581539d7d

shards

i                                                                      s  p st             d n
packetbeat-2019.12.02                                                  2  r STARTED    15509 arm-or-001_data
packetbeat-2019.12.02                                                  2  p STARTED    15509 arm-or-008_data
packetbeat-2019.12.02                                                  4  p STARTED    15438 arm-or-004_data
packetbeat-2019.12.02                                                  4  r STARTED    15438 arm-or-003_data
packetbeat-2019.12.02                                                  3  r STARTED    15398 arm-or-002_data
packetbeat-2019.12.02                                                  3  p STARTED    15398 arm-or-009_data
packetbeat-2019.12.02                                                  1  r STARTED    15474 arm-or-005_data
packetbeat-2019.12.02                                                  1  p STARTED    15474 arm-or-006_data
packetbeat-2019.12.02                                                  0  r STARTED    15499 arm-or-010_data

This file has been truncated. show original

gist.github.com

https://gist.github.com/vinicioflores/495d94a3e20de3d1f8af52283de27b10

shard recovery

i                                                                      s  t     source_node     target_node     files files_total bytes      translog_ops
apm-6.2.4-2019.12.10                                                   0  859ms arm-or-002_data arm-or-005_data 1     1           230        4
apm-6.2.4-2019.12.10                                                   0  9s    arm-or-002_data arm-or-009_data 22    22          13140356   0
apm-6.2.4-2019.12.11                                                   0  2.8s  arm-or-010_data arm-or-001_data 1     1           230        0
apm-6.2.4-2019.12.11                                                   0  2.6m  arm-or-010_data arm-or-006_data 1     1           230        19667
apm-6.2.4-2019.12.12                                                   0  34ms  n/a             arm-or-008_data 0     0           0          0
apm-6.2.4-2019.12.12                                                   0  1.6s  arm-or-008_data arm-or-009_data 1     1           230        0
apm-6.2.4-2019.12.13                                                   0  30ms  n/a             arm-or-008_data 0     0           0          0
apm-6.2.4-2019.12.13                                                   0  807ms arm-or-008_data arm-or-009_data 1     1           230        0
ontrak_execution_prod-2019-11                                          0  7ms   n/a             arm-or-010_data 0     0           0          0

This file has been truncated. show original

Here's the mapping info of the most problematic index we have in the cluster (I'm using dynamic mappings template for it)

gist.github.com

https://gist.github.com/vinicioflores/2bc85779dcc1daad33cb9ff7336c5cb0

problematic index

{
  "daas-arm-prod-users-2019-12": {
    "aliases": {
      "daas-arm-users": {}
    },
    "mappings": {
      "_default_": {
        "dynamic_templates": [
          {
            "string_fields": {

This file has been truncated. show original

We use SAN/LUN based storage (magnitude of TBs of space) and all servers have same specs. How can I find out if it's due to bad disk I/O?

Didn't find anything in ES logs though

viniciof · December 13, 2019, 7:49pm

I see also my thread pools very packed:

gist.github.com

https://gist.github.com/vinicioflores/846fdf0bebda2ce889430cff86a6a440

thread pool 1

node_name         name                active queue rejected
arm-or-005_data   bulk                     0     0      792
arm-or-005_data   fetch_shard_started      0     0        0
arm-or-005_data   fetch_shard_store        0     0        0
arm-or-005_data   flush                    0     0        0
arm-or-005_data   force_merge              0     0        0
arm-or-005_data   generic                  5     0        0
arm-or-005_data   get                      0     0        0
arm-or-005_data   index                    0     0        0
arm-or-005_data   listener                 0     0        0

This file has been truncated. show original

@Christian_Dahlqvist, How can I find out the details of those specific threads taking all available slots in each queue ?

viniciof · December 13, 2019, 7:55pm

And here my global cluster's settings:

{
  "persistent": {
    "cluster": {
      "routing": {
        "allocation": {
          "awareness": {
            "attributes": ""
          }
        }
      }
    },
    "indices": {
      "breaker": {
        "fielddata": {
          "limit": "60%"
        },
        "request": {
          "limit": "30%"
        }
      }
    }
  },
  "transient": {
    "indices": {
      "recovery": {
        "max_bytes_per_sec": "256mb"
      }
    }
  }
}

viniciof · December 17, 2019, 12:20am

@Christian_Dahlqvist can you help ? Let me know if any more information is needed

viniciof · December 17, 2019, 3:23pm

Hi @Christian_Dahlqvist,

Here's my node stats. I notice one of my nodes (arm-or-009_data) is 99% memory utilization .

gist.github.com

https://gist.github.com/vinicioflores/6b0d98b50521f20b0381c5169463db04

nodes stats

{
  "_nodes": {
    "total": 20,
    "successful": 20,
    "failed": 0
  },
  "cluster_name": "ibi.arm2.us",
  "nodes": {
    "a5voC7oXSf-78LfPlJ45Yg": {
      "timestamp": 1576595678727,

This file has been truncated. show original

And these are the stats for the most problematic (slow indexing) index in the cluster, "daas-arm-prod-users-2019-12-new"

gist.github.com

https://gist.github.com/vinicioflores/d58bbfb9cccaf2bd3b075f0601b3ff5d

gistfile1.txt

{
  "_shards": {
    "total": 80,
    "successful": 80,
    "failed": 0
  },
  "_all": {
    "primaries": {
      "docs": {
        "count": 6717777,

This file has been truncated. show original

Christian_Dahlqvist · December 18, 2019, 7:08am

Are there any error messages in the Elasticsearch logs? Can you try enabling the dead-letter queue to see if this captures any errors that would otherwise be ignored/dropped?

viniciof · December 18, 2019, 10:42pm

I checked the logs and no errors seem to be there. There's only one that the cluster complains a lot:

[2019-12-18T14:39:44,682][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [arm-or-002_master] collector [cluster_stats] timed out when collecting data

and

[2019-12-18T14:29:24,586][ERROR][o.e.x.m.c.i.IndexStatsCollector] [arm-or-002_master] collector [index-stats] timed out when collecting data

Whenever this is logged, it causes a "blank" patch in the overview section of monitoring of the cluster in Kibana (like cluster is unresponsive during that time of exception)

system · January 15, 2020, 10:42pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Indexing is being throttled Elasticsearch	7	2672	July 6, 2017
ES crashing multiple times, over 1Billion docs a day, indexing rate falling from 25k/s to 2k/s Elasticsearch	10	1634	July 10, 2019
Cluster resource usage Elasticsearch	14	443	July 6, 2017
Cluster not able to keep up? Elasticsearch	12	4230	July 6, 2017
Performance weird stuff Elasticsearch	13	903	September 25, 2020

Related topics