Disk storage not increasing despite indexes being growing

Hi all,

Noticed today something very weird, our Logstash workers keep emitting data consistently to two of our ES clusters, rate as expected.

Disk storage in one of them gets consumed as expected also, but the other remains flat (and there are rejected events in thread pool in most of the data nodes)

Problematic cluster

Good cluster

how can I find out what's wrong ?

rgds,

What does your Logstash outputs look like?

They send bulk requests, like the one as follows:

What does the configuration look like? Do you have a separate output plugin per cluster?

I have separate output plugin for each, here the Logstash pipeline:

Problematic one

output {
  elasticsearch {
   hosts => [
        "arm-or-006.myserver.com:9996",
        "arm-or-007.myserver.com:9996",
        "arm-or-008.myserver.com:9996",
        "arm-or-010.myserver.com:9996"
    ]
    ssl => true
    cacert => "/app/ssl/cert.pem"
    user => "myuser"
    password => "mypass"
    document_type => "arm"
    document_id => "%{ibi_id}"
    index => "%{ibi_target}-%{+YYYY-MM}"
    doc_as_upsert => true
    action => "update"
    retry_max_interval => 5
    retry_on_conflict => 5
    flush_size => 10000
    timeout => 1000000
  }
}

Good one

output {
  elasticsearch {
    hosts => [
        "arm-lc-001.myserver.com:9996",
        "arm-lc-003.myserver.com:9996",
        "arm-lc-004.myserver.com:9996",
        "arm-lc-005.myserver.com:9996"
    ]
    ssl => true
    cacert => "/app/ssl/cert.pem"
    ssl_certificate_verification => false
    user => "myuser"
    password => "mypass"
    document_type => "arm"
    document_id => "%{ibi_id}"
    index => "%{ibi_target}-%{+YYYY-MM}"
    doc_as_upsert => true
    action => "update"
    retry_max_interval => 5
    retry_on_conflict => 5
    flush_size => 10000
    timeout => 1000000
  }
}

Also, this is what I see in hot threads:

@Christian_Dahlqvist , what does this mean ?

I have never run bulk updates so am not sure if errors here would cause the update to be retried from Logstash or simply dropped. You seem to have a lot of time spent on management. Do you have a very large number of shards in the cluster? Are you using dynamic mappings? Does the hardware profiles supporting the cluster s differ, especially with respect to the type of storage used? Is there anything in the Elasticsearch logs?

Here is all sharding info in my cluster:

Here's the mapping info of the most problematic index we have in the cluster (I'm using dynamic mappings template for it)

We use SAN/LUN based storage (magnitude of TBs of space) and all servers have same specs. How can I find out if it's due to bad disk I/O?

Didn't find anything in ES logs though

I see also my thread pools very packed:

@Christian_Dahlqvist, How can I find out the details of those specific threads taking all available slots in each queue ?

And here my global cluster's settings:

{
  "persistent": {
    "cluster": {
      "routing": {
        "allocation": {
          "awareness": {
            "attributes": ""
          }
        }
      }
    },
    "indices": {
      "breaker": {
        "fielddata": {
          "limit": "60%"
        },
        "request": {
          "limit": "30%"
        }
      }
    }
  },
  "transient": {
    "indices": {
      "recovery": {
        "max_bytes_per_sec": "256mb"
      }
    }
  }
}

@Christian_Dahlqvist can you help ? Let me know if any more information is needed

Hi @Christian_Dahlqvist,

Here's my node stats. I notice one of my nodes (arm-or-009_data) is 99% memory utilization .

And these are the stats for the most problematic (slow indexing) index in the cluster, "daas-arm-prod-users-2019-12-new"

Are there any error messages in the Elasticsearch logs? Can you try enabling the dead-letter queue to see if this captures any errors that would otherwise be ignored/dropped?

I checked the logs and no errors seem to be there. There's only one that the cluster complains a lot:

[2019-12-18T14:39:44,682][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [arm-or-002_master] collector [cluster_stats] timed out when collecting data

and

[2019-12-18T14:29:24,586][ERROR][o.e.x.m.c.i.IndexStatsCollector] [arm-or-002_master] collector [index-stats] timed out when collecting data

Whenever this is logged, it causes a "blank" patch in the overview section of monitoring of the cluster in Kibana (like cluster is unresponsive during that time of exception)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.