Kibana says "Error: Request Timeout after 30000ms" when we started to retain longer history in ES

I provide details below, straight to the point: I suspect that increased index count made impossible to perform some Kibana built-in queries.

ls
elasticsearch-5.6.14.deb  kibana-5.6.14-amd64.deb

I suspect that it is impossible to query 144GB of indexes on HDD with 4GB ES heap + 3GB mmap RAM:

GET /_nodes/stats/jvm

  "_nodes": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  ...
  "nodes": {
    "Tko1n5etQrmMvEm5viBBew": {
      "timestamp": 1549977948232,
      "name": "prod-es-1",
      "transport_address": "10.132.0.39:9300",
      "host": "prod-es-kibana",
      "ip": "10.132.0.39:9300",
      "roles": [
        "master",
        "data",
        "ingest"
      ],
      "jvm": {
        "timestamp": 1549977948232,
        "uptime_in_millis": 4258976721,
        "mem": {
          "heap_used_in_bytes": 1375522832,
          "heap_used_percent": 32,
          "heap_committed_in_bytes": 4277534720,
          "heap_max_in_bytes": 4277534720,
          "non_heap_used_in_bytes": 133982776,
          "non_heap_committed_in_bytes": 141037568,
          "pools": {
            "young": {
              "used_in_bytes": 63851576,
              "max_in_bytes": 139591680,
              "peak_used_in_bytes": 139591680,
              "peak_max_in_bytes": 139591680
            },
            "survivor": {
              "used_in_bytes": 1917432,
              "max_in_bytes": 17432576,
              "peak_used_in_bytes": 17432576,
              "peak_max_in_bytes": 17432576
            },
            "old": {
              "used_in_bytes": 1309753824,
              "max_in_bytes": 4120510464,
              "peak_used_in_bytes": 3109351920,
              "peak_max_in_bytes": 4120510464
            }
          }
        },
        "threads": {
          "count": 39,
          "peak_count": 41
        },
        "gc": {
          "collectors": {
            "young": {
              "collection_count": 200423,
              "collection_time_in_millis": 6195361
            },
            "old": {
              "collection_count": 153,
              "collection_time_in_millis": 16002
            }
          }
        },
        "buffer_pools": {
          "direct": {
            "count": 29,
            "used_in_bytes": 67482875,
            "total_capacity_in_bytes": 67482874
          },
          "mapped": {
            "count": 3098,
            "used_in_bytes": 145535101052,
            "total_capacity_in_bytes": 145535101052
          }
        },
        "classes": {
          "current_loaded_count": 11930,
          "total_loaded_count": 11961,
          "total_unloaded_count": 31
        }
      }

With total 7.5 GB RAM (Google GCE n1-standard-2 node) I set -Xmx4g but indexes take 145GB (51 days of history, previously we worked with 20 days).

ES host is baked by HDD, not SSD.

Looks like some queries with sorting per time performed on all indexes and cause in Kibana:

Error: Request Timeout after 30000ms

I applied mostly all "best practice" for ES, I'll show that below.

The only one thing that is questionable: I decided to go with single node because logback-elasticsearch-appender connects to single node only currently (doesn't have round-robin fallback to other hosts from the list). No one will get fired if we lose logs ))

One host, indexes are per day, one shard per index (yellow status xD):

"settings": {
  "index": {
    "number_of_shards": "1",
    "number_of_replicas": "1",
    "refresh_interval": "20s"
  }

Typical index (53 so far, what to stop at 60):

GET /_cat/indices?v&s=index

health status index         uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   sa-2019-02-10 WzW_8x4zQBGAzsXA_11Y7Q   1   1   20440568            0      2.8gb          2.8gb

No problems with ingesting. 2 CPU below 5%.

$ free
              total        used        free      shared  buff/cache   available
Mem:        7663544     5108168      164424       80228     2390952     2218252
Swap:             0           0           0

$ sudo sysctl fs.file-max
fs.file-max = 761999

Can't explain low disk reads during "breaking" request (is it KVM glitch? or whatever Google uses...):

$ sudo iostat
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda              13.27        48.31       327.88  205978335 1397860156

ES + Kibana are used for storing logs from Java apps and investigating incidents.

Typical investigation steps:

  • Under Discover we look for suspicious events, like _exists_:exception
  • We open View surrounding documents on event
  • We apply filters to narrow surrounding events to single execution unit. It is host + pid + thread and looking around what happen before exception (to gather evidence what may lead to exception) and after (to understand incident implication on app health and define severity)

With: Error: Request Timeout after 30000ms in Kibana in View surrounding documents I see:

$ journalctl -u kibana.service
"res":{"statusCode":504,"responseTime":30009,"contentLength":9},"message":"POST /elasticsearch/_msearch 504 30009ms - 9.0B"}

ES logs are empty ))

Of cause filters have low cardinality (there are only 20 different hosts, thousands PIDs, tenths threads).

In mapping I marked those columns as keywords:

      "thread": {
        "type": "keyword"
      },
      "host": {
        "type": "keyword"
      },

so thread name like batch-worker-8 isn't analyzed as 3 word unit but rather single unit. That also makes possible aggregation queries besides reducing index size and storage requirement ))

I think about disabling _source. We don't reindex - we just drop them )) I've never profiled queries as I perform them via Kibana only.

It should shrink indexes. I'd like to get to 2 month of history.

https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html

GET /_template/sa

"mappings": {
  "sa": {
    "_all": {
      "enabled": false
    },
    "_source": {
      "enabled": true
    },

I am going to decrease ES heap to 2.5GB to get more space for index mmap (out of 7.5GB).

Any ideas to dig dipper?

As workaround QA engineer (one, we are small) and I are using Kibana's Discovery plane to investigate flow around issue.

It works there because there is time restriction that I believe limits search to single index.

But... In order to get to right point we need to adjust time range manually and do a lot of scrolling to just load base event (Discovery page shows limiten number of event and load rest on demand while you are scrolling).

View surrounding documents directly gives surrounding context (having that we need still apply host/pid/thread filters).

EXTRA NOTE: View surrounding documents works when there are no filters in effect.

When you apply low cardinality filter it is broken. Less cardinality - more probability of failure.

Predicting suggestion to split indexes across node cluster - why should I do that?

Is it because indexes are not mapped in RAM? Shouldn't I just increase RAM?

Please, any suggestion?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.