Hi there,
I have a small Elasticsearch single node cluster for development purpose that is crashing about one time per month, even when there are no trafic/requests (like in the middle of the night, no one is working on it).
It is running Elasticsearch 7.4.2 on a dualcore instance with 2GB RAM.
According to kibana (also running on the same node), there are
- 4,496,347 documents
- 77 indices
- 175 primary shards
- 2.2Gb of disk usage
The document count is mostly the .monitoring indices, we have < 10000 documents on from our own.
The node performs well without any issues, until it suddenly crashes.
JVM config is :
-Xms768m
-Xmx1g
Here are the output from _cluster/stats?human&pretty
(after the restart):
{
"_nodes": {
"total": 1,
"successful": 1,
"failed": 0
},
"cluster_name": "elasticsearch",
"cluster_uuid": "WJOOuxXuTd2dQ0UMMhuPkg",
"timestamp": 1634113459473,
"status": "yellow",
"indices": {
"count": 77,
"shards": {
"total": 175,
"primaries": 175,
"replication": 0.0,
"index": {
"shards": {
"min": 1,
"max": 5,
"avg": 2.272727272727273
},
"primaries": {
"min": 1,
"max": 5,
"avg": 2.272727272727273
},
"replication": {
"min": 0.0,
"max": 0.0,
"avg": 0.0
}
}
},
"docs": {
"count": 4498732,
"deleted": 2945166
},
"store": {
"size": "2.1gb",
"size_in_bytes": 2342333611
},
"fielddata": {
"memory_size": "21.6kb",
"memory_size_in_bytes": 22128,
"evictions": 0
},
"query_cache": {
"memory_size": "1.8mb",
"memory_size_in_bytes": 1921056,
"total_count": 31965,
"hit_count": 13810,
"miss_count": 18155,
"cache_size": 276,
"cache_count": 330,
"evictions": 54
},
"completion": {
"size": "0b",
"size_in_bytes": 0
},
"segments": {
"count": 486,
"memory": "5.1mb",
"memory_in_bytes": 5358682,
"terms_memory": "2.7mb",
"terms_memory_in_bytes": 2900928,
"stored_fields_memory": "507.9kb",
"stored_fields_memory_in_bytes": 520096,
"term_vectors_memory": "0b",
"term_vectors_memory_in_bytes": 0,
"norms_memory": "48.5kb",
"norms_memory_in_bytes": 49664,
"points_memory": "1mb",
"points_memory_in_bytes": 1123210,
"doc_values_memory": "746.8kb",
"doc_values_memory_in_bytes": 764784,
"index_writer_memory": "0b",
"index_writer_memory_in_bytes": 0,
"version_map_memory": "0b",
"version_map_memory_in_bytes": 0,
"fixed_bit_set": "905.6kb",
"fixed_bit_set_memory_in_bytes": 927416,
"max_unsafe_auto_id_timestamp": 1634108911687,
"file_sizes": {}
}
},
"nodes": {
"count": {
"total": 1,
"coordinating_only": 0,
"data": 1,
"ingest": 1,
"master": 1,
"ml": 1,
"voting_only": 0
},
"versions": [
"7.4.2"
],
"os": {
"available_processors": 2,
"allocated_processors": 2,
"names": [
{
"name": "Linux",
"count": 1
}
],
"pretty_names": [
{
"pretty_name": "Ubuntu 20.04.1 LTS",
"count": 1
}
],
"mem": {
"total": "1.9gb",
"total_in_bytes": 2044534784,
"free": "85.6mb",
"free_in_bytes": 89849856,
"used": "1.8gb",
"used_in_bytes": 1954684928,
"free_percent": 4,
"used_percent": 96
}
},
"process": {
"cpu": {
"percent": 8
},
"open_file_descriptors": {
"min": 1353,
"max": 1353,
"avg": 1353
}
},
"jvm": {
"max_uptime": "1.2h",
"max_uptime_in_millis": 4592145,
"versions": [
{
"version": "13.0.1",
"vm_name": "OpenJDK 64-Bit Server VM",
"vm_version": "13.0.1+9",
"vm_vendor": "AdoptOpenJDK",
"bundled_jdk": true,
"using_bundled_jdk": true,
"count": 1
}
],
"mem": {
"heap_used": "535mb",
"heap_used_in_bytes": 561009184,
"heap_max": "1007.3mb",
"heap_max_in_bytes": 1056309248
},
"threads": 48
},
"fs": {
"total": "67.7gb",
"total_in_bytes": 72794869760,
"free": "58.7gb",
"free_in_bytes": 63081897984,
"available": "58.7gb",
"available_in_bytes": 63065120768
},
"plugins": [
{
"name": "repository-s3",
"version": "7.4.2",
"elasticsearch_version": "7.4.2",
"java_version": "1.8",
"description": "The S3 repository plugin adds S3 repositories",
"classname": "org.elasticsearch.repositories.s3.S3RepositoryPlugin",
"extended_plugins": [],
"has_native_controller": false
}
],
"network_types": {
"transport_types": {
"security4": 1
},
"http_types": {
"security4": 1
}
},
"discovery_types": {
"single-node": 1
},
"packaging_types": [
{
"flavor": "default",
"type": "deb",
"count": 1
}
]
}
}
(I knowI have replica shards that are unassigned, but it shouldn't be an issue)
There are nothing on the logs when the crash happens :
#Note: log time is UTC
[2021-10-13T01:30:00,010][INFO ][o.e.x.m.a.TransportDeleteExpiredDataAction] [flus-es-dev] Deleting expired data
[2021-10-13T01:30:00,024][INFO ][o.e.x.m.a.TransportDeleteExpiredDataAction] [flus-es-dev] Completed deletion of expired ML data
[2021-10-13T01:30:00,024][INFO ][o.e.x.m.MlDailyMaintenanceService] [flus-es-dev] Successfully completed [ML] maintenance tasks
--- it crashed at Oct 13 03:17:23 according to the kernel log OOM ---
-> restart
[2021-10-13T07:07:51,910][INFO ][o.e.e.NodeEnvironment ] [flus-es-dev] using [1] data paths, mounts [[/ (/dev/root)]], net usable_space [58.5gb], net total_space [67.7gb], types [ext4]
[2021-10-13T07:07:51,937][INFO ][o.e.e.NodeEnvironment ] [flus-es-dev] heap size [1007.3mb], compressed ordinary object pointers [true]
[...]
The only thing that seems strange is the gc.log, it is constantly running with "allocation failed" messages, I read elsewhere that it shouldn't be an issue but I find it strange.
--> See attached GC log : [2021-10-13T08:34:59.517+0000][468][gc,start ] GC(596) Pause Young (Allocati - Pastebin.com
And kibana capture (times are UTC+2 Paris) :
Any clues?