I provide details below, straight to the point: I suspect that increased index count made impossible to perform some Kibana built-in queries.
ls
elasticsearch-5.6.14.deb kibana-5.6.14-amd64.deb
I suspect that it is impossible to query 144GB of indexes on HDD with 4GB ES heap + 3GB mmap RAM:
GET /_nodes/stats/jvm
"_nodes": {
"total": 1,
"successful": 1,
"failed": 0
},
...
"nodes": {
"Tko1n5etQrmMvEm5viBBew": {
"timestamp": 1549977948232,
"name": "prod-es-1",
"transport_address": "10.132.0.39:9300",
"host": "prod-es-kibana",
"ip": "10.132.0.39:9300",
"roles": [
"master",
"data",
"ingest"
],
"jvm": {
"timestamp": 1549977948232,
"uptime_in_millis": 4258976721,
"mem": {
"heap_used_in_bytes": 1375522832,
"heap_used_percent": 32,
"heap_committed_in_bytes": 4277534720,
"heap_max_in_bytes": 4277534720,
"non_heap_used_in_bytes": 133982776,
"non_heap_committed_in_bytes": 141037568,
"pools": {
"young": {
"used_in_bytes": 63851576,
"max_in_bytes": 139591680,
"peak_used_in_bytes": 139591680,
"peak_max_in_bytes": 139591680
},
"survivor": {
"used_in_bytes": 1917432,
"max_in_bytes": 17432576,
"peak_used_in_bytes": 17432576,
"peak_max_in_bytes": 17432576
},
"old": {
"used_in_bytes": 1309753824,
"max_in_bytes": 4120510464,
"peak_used_in_bytes": 3109351920,
"peak_max_in_bytes": 4120510464
}
}
},
"threads": {
"count": 39,
"peak_count": 41
},
"gc": {
"collectors": {
"young": {
"collection_count": 200423,
"collection_time_in_millis": 6195361
},
"old": {
"collection_count": 153,
"collection_time_in_millis": 16002
}
}
},
"buffer_pools": {
"direct": {
"count": 29,
"used_in_bytes": 67482875,
"total_capacity_in_bytes": 67482874
},
"mapped": {
"count": 3098,
"used_in_bytes": 145535101052,
"total_capacity_in_bytes": 145535101052
}
},
"classes": {
"current_loaded_count": 11930,
"total_loaded_count": 11961,
"total_unloaded_count": 31
}
}
With total 7.5 GB RAM (Google GCE n1-standard-2 node) I set -Xmx4g
but indexes take 145GB (51 days of history, previously we worked with 20 days).
ES host is baked by HDD, not SSD.
Looks like some queries with sorting per time performed on all indexes and cause in Kibana:
Error: Request Timeout after 30000ms
I applied mostly all "best practice" for ES, I'll show that below.
The only one thing that is questionable: I decided to go with single node because logback-elasticsearch-appender
connects to single node only currently (doesn't have round-robin fallback to other hosts from the list). No one will get fired if we lose logs ))
One host, indexes are per day, one shard per index (yellow status xD):
"settings": {
"index": {
"number_of_shards": "1",
"number_of_replicas": "1",
"refresh_interval": "20s"
}
Typical index (53 so far, what to stop at 60):
GET /_cat/indices?v&s=index
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open sa-2019-02-10 WzW_8x4zQBGAzsXA_11Y7Q 1 1 20440568 0 2.8gb 2.8gb
No problems with ingesting. 2 CPU below 5%.
$ free
total used free shared buff/cache available
Mem: 7663544 5108168 164424 80228 2390952 2218252
Swap: 0 0 0
$ sudo sysctl fs.file-max
fs.file-max = 761999
Can't explain low disk reads during "breaking" request (is it KVM glitch? or whatever Google uses...):
$ sudo iostat
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 13.27 48.31 327.88 205978335 1397860156
ES + Kibana are used for storing logs from Java apps and investigating incidents.
Typical investigation steps:
- Under Discover we look for suspicious events, like
_exists_:exception
- We open View surrounding documents on event
- We apply filters to narrow surrounding events to single execution unit. It is
host
+pid
+thread
and looking around what happen before exception (to gather evidence what may lead to exception) and after (to understand incident implication on app health and define severity)
With: Error: Request Timeout after 30000ms
in Kibana in View surrounding documents I see:
$ journalctl -u kibana.service
"res":{"statusCode":504,"responseTime":30009,"contentLength":9},"message":"POST /elasticsearch/_msearch 504 30009ms - 9.0B"}
ES logs are empty ))
Of cause filters have low cardinality (there are only 20 different hosts, thousands PIDs, tenths threads).
In mapping I marked those columns as keywords
:
"thread": {
"type": "keyword"
},
"host": {
"type": "keyword"
},
so thread name like batch-worker-8
isn't analyzed as 3 word unit but rather single unit. That also makes possible aggregation queries besides reducing index size and storage requirement ))
I think about disabling _source
. We don't reindex - we just drop them )) I've never profiled queries as I perform them via Kibana only.
It should shrink indexes. I'd like to get to 2 month of history.
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html
GET /_template/sa
"mappings": {
"sa": {
"_all": {
"enabled": false
},
"_source": {
"enabled": true
},
I am going to decrease ES heap to 2.5GB to get more space for index mmap (out of 7.5GB).
Any ideas to dig dipper?