Hi everyone,
Could not find any post relating the problem i encounter so i here it is:
Since a while i am unable to use Discover tab in Kibana ; when i try to, it basically crashes the Elasticsearch coordinating node while trying to load the page.
Setup:
- 2 reverse-proxy (HAProxy) activ/passiv using VRRP acting as entry point ( 4 CPU / 12 Go RAM )
- 3 master nodes ( 4 CPU / 12 Go RAM )
- 3 coordinating node running Kibana ( 4 CPU / 12 Go RAM )
- 18 hot data nodes ( 16 CPU / 48 Go RAM )
- 27 warm data nodes ( 8 CPU / 48 Go RAM )
- 12 cold data nodes ( 8 CPU / 48 Go RAM )
For all Elasticsearch nodes HEAP_SIZE is set to half of the available RAM
Cluster informations:
{
"name": "log-esc-1",
"cluster_name": "es-cluster-1",
"cluster_uuid": "XKwqFkITZTWa23-PoL-ASMW",
"version": {
"number": "7.13.4",
"build_flavor": "default",
"build_type": "deb",
"build_hash": "c5f60e894ca0c61cdbae4f5a686d9f08bcefc942",
"build_date": "2021-07-14T18:33:36.673943207Z",
"build_snapshot": false,
"lucene_version": "8.8.2",
"minimum_wire_compatibility_version": "6.8.0",
"minimum_index_compatibility_version": "6.0.0-beta1"
},
"tagline": "You Know, for Search"
}
So when reaching "Discover" tab in Kibana, it loads for at least a minute and then crash the node.
Dumping the heap ; then i have to restart the Elasticsearch service to recover.
Dec 03 15:12:19 log-elasticsearch-1-1-coordinating-1 systemd-entrypoint[20852]: java.lang.OutOfMemoryError: Java heap space
Dec 03 15:12:19 log-elasticsearch-1-1-coordinating-1 systemd-entrypoint[20852]: Dumping heap to /var/lib/elasticsearch/java_pid20852.hprof ...
Dec 03 15:13:37 log-elasticsearch-1-1-coordinating-1 systemd-entrypoint[20852]: Heap dump file created [9029375683 bytes in 78.111 secs]
Looking up in the Kibana logs we see those errors while trying to load the page
{
"type": "log",
"@timestamp": "2021-11-26T08:54:54+01:00",
"tags.0": "error",
"tags.1": "plugins",
"tags.2": "taskManager",
"pid": 16966,
"message": "Failed to poll for work: Error: work has timed out"
}
From the developper console of the browser, network tab, i can see that the following query is taking forever:
/api/index_patterns/_fields_for_wildcard?pattern=mail-logs-*&meta_fields=_source&meta_fields=_id&me
ta_fields=_type&meta_fields=_index&meta_fields=_score"
I am able to reproduce the problem simply by calling that URL
# curl -vv -XGET -H "Content-Type: application/json" -H "kbn-xsrf: true" "http
://log-esc-1:5601/api/index_patterns/_fields_for_wildcard?pattern=mail-logs-*&meta_fields=_source&meta_fields=_id&me
ta_fields=_type&meta_fields=_index&meta_fields=_score"
* Trying 192.168.1.151...
* TCP_NODELAY set
* Connected to log-esc-1 (192.168.1.151) port 5601 (#0)
> GET /api/index_patterns/_fields_for_wildcard?pattern=logs-*&meta_fields=_source&meta_fields=_id&meta_fields=_type&meta_fields=_index&meta_fields=_score HTTP/1.1
> Host: log-esc-1:5601
> User-Agent: curl/7.52.1
> Accept: */*
> Content-Type: application/json
> kbn-xsrf: true
>
* Curl_http_done: called premature == 0
* Empty reply from server
* Connection #0 to host log-esc-1 left intact
curl: (52) Empty reply from server
In Elasticsearch logs:
[2021-12-03T15:13:52,743][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [log-esc-1] fatal error in thread [elasticsearch[log-esc-1][generic][T#16]], exiting
java.lang.OutOfMemoryError: Java heap space
at java.util.stream.Collectors.toSet(Collectors.java:327) ~[?:?]
at org.elasticsearch.cluster.block.ClusterBlocks.generateLevelHolders(ClusterBlocks.java:87) ~[elasticsearch-7.13.4.jar:7.13.4]
at org.elasticsearch.cluster.block.ClusterBlocks.<init>(ClusterBlocks.java:50) ~[elasticsearch-7.13.4.jar:7.13.4]
at org.elasticsearch.cluster.block.ClusterBlocks$Builder.build(ClusterBlocks.java:434) ~[elasticsearch-7.13.4.jar:7.13.4]
at org.elasticsearch.cluster.coordination.Coordinator.clusterStateWithNoMasterBlock(Coordinator.java:1057) ~[elasticsearch-7.13.4.jar:7.13.4]
at org.elasticsearch.cluster.coordination.Coordinator.getStateForMasterService(Coordinator.java:1045) ~[elasticsearch-7.13.4.jar:7.13.4]
at org.elasticsearch.cluster.coordination.Coordinator.getClusterFormationState(Coordinator.java:204) ~[elasticsearch-7.13.4.jar:7.13.4]
at org.elasticsearch.cluster.coordination.Coordinator$$Lambda$4117/0x0000000801650f60.get(Unknown Source) ~[?:?]
at org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper$WarningScheduler$1.doRun(ClusterFormationFailureHelper.java:91) ~[elasticsearch-7.13.4.jar:7.13.4]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732) ~[elasticsearch-7.13.4.jar:7.13.4]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.13.4.jar:7.13.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
at java.lang.Thread.run(Thread.java:831) [?:?]
Trying to load fields of my index pattern ( from "Management" -> "Stack management" -> "Index patterns" ) fail as well.
Seems logical since it also calls that endpoint ( /api/index_patterns/_fields_for_wildcard?pattern=logs-*&meta_fields=_source&meta_fields=_id&meta_fields=_type&meta_fields=_index&meta_fields=_score )
I use a static mapping to avoid "mapping explosion" and have a reasonable number of fields, at least i thought until i queried
GET logs/_mapping
Takes about 10 seconds to return me a "small" payload of 7' 907' 833 lines ( OK unflattened JSON but still .. )
{
"logs-004155" : {
"mappings" : {
"dynamic_templates" : [
{
"message_field" : {
"path_match" : "message",
"match_mapping_type" : "string",
"mapping" : {
"norms" : false,
"type" : "text"
}
}
},
{
"string_fields" : {
"match" : "*",
"match_mapping_type" : "string",
"mapping" : {
"fields" : {
"keyword" : {
"ignore_above" : 256,
"type" : "keyword"
}
},
"norms" : false,
"type" : "text"
}
}
}
],
"properties" : {
"@timestamp" : {
"type" : "date"
},
"@version" : {
"type" : "keyword"
},
"agent" : {
"properties" : {
"ephemeral_id" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
},
"norms" : false
},
"hostname" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
},
"norms" : false
},
"id" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
},
"norms" : false
},
"type" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
},
"norms" : false
},
"version" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
},
"norms" : false
}
}
},
[...]
The reason for so many lines seem to be that there is a mapping for every indice and i got ~ 4000 indices ; OK fine, but could it be the reason ES crashes ?
What happens exactly when calling ( /api/index_patterns/_fields_for_wildcard ) ?
My index patterns is set to match logs-* and every of the 4'000 indices has logs
alias pointing on the real indice name, could it be a problem ?
Does anyone already experienced such issue ?
I would appreciate any hints to move forward.
Thank you for your help
EDIT 1 : by increasing the elasticsearch.requestTimeout
in /etc/kibana/kibana.yml
up to 120000
i am finally able to get the response ! It's a 343K bytes response of 15'277 lines.
Of these 15'277, almost 10'000 are filled with the conflictDescriptions
for a given field:
[...]
{
"name": "score",
"type": "conflict",
"esTypes": [
"text",
"integer"
],
"searchable": true,
"aggregatable": true,
"readFromDocValues": false,
"conflictDescriptions": {
"text": [
"shrink--0vk-logs-000992",
"shrink--efi-logs-000978",
"shrink--m8o-logs-000786",
"shrink-04gb-logs-000895",
"shrink-0jgo-logs-000997",
"shrink-1oiv-logs-001082",
"shrink-24qh-logs-001086",
[...]
i also found a related issue on Github -> conflictDescriptions in index-pattern can get really large (>10MB) · Issue #17007 · elastic/kibana · GitHub seem to be the root cause !
Any idea how to get rid of those conflictDescriptions
?