Out Of Memory crash, few documents & load

Hi there,

I have a small Elasticsearch single node cluster for development purpose that is crashing about one time per month, even when there are no trafic/requests (like in the middle of the night, no one is working on it).

It is running Elasticsearch 7.4.2 on a dualcore instance with 2GB RAM.
According to kibana (also running on the same node), there are

  • 4,496,347 documents
  • 77 indices
  • 175 primary shards
  • 2.2Gb of disk usage

The document count is mostly the .monitoring indices, we have < 10000 documents on from our own.

The node performs well without any issues, until it suddenly crashes.

JVM config is :

-Xms768m
-Xmx1g

Here are the output from _cluster/stats?human&pretty (after the restart):

{
    "_nodes": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "cluster_name": "elasticsearch",
    "cluster_uuid": "WJOOuxXuTd2dQ0UMMhuPkg",
    "timestamp": 1634113459473,
    "status": "yellow",
    "indices": {
        "count": 77,
        "shards": {
            "total": 175,
            "primaries": 175,
            "replication": 0.0,
            "index": {
                "shards": {
                    "min": 1,
                    "max": 5,
                    "avg": 2.272727272727273
                },
                "primaries": {
                    "min": 1,
                    "max": 5,
                    "avg": 2.272727272727273
                },
                "replication": {
                    "min": 0.0,
                    "max": 0.0,
                    "avg": 0.0
                }
            }
        },
        "docs": {
            "count": 4498732,
            "deleted": 2945166
        },
        "store": {
            "size": "2.1gb",
            "size_in_bytes": 2342333611
        },
        "fielddata": {
            "memory_size": "21.6kb",
            "memory_size_in_bytes": 22128,
            "evictions": 0
        },
        "query_cache": {
            "memory_size": "1.8mb",
            "memory_size_in_bytes": 1921056,
            "total_count": 31965,
            "hit_count": 13810,
            "miss_count": 18155,
            "cache_size": 276,
            "cache_count": 330,
            "evictions": 54
        },
        "completion": {
            "size": "0b",
            "size_in_bytes": 0
        },
        "segments": {
            "count": 486,
            "memory": "5.1mb",
            "memory_in_bytes": 5358682,
            "terms_memory": "2.7mb",
            "terms_memory_in_bytes": 2900928,
            "stored_fields_memory": "507.9kb",
            "stored_fields_memory_in_bytes": 520096,
            "term_vectors_memory": "0b",
            "term_vectors_memory_in_bytes": 0,
            "norms_memory": "48.5kb",
            "norms_memory_in_bytes": 49664,
            "points_memory": "1mb",
            "points_memory_in_bytes": 1123210,
            "doc_values_memory": "746.8kb",
            "doc_values_memory_in_bytes": 764784,
            "index_writer_memory": "0b",
            "index_writer_memory_in_bytes": 0,
            "version_map_memory": "0b",
            "version_map_memory_in_bytes": 0,
            "fixed_bit_set": "905.6kb",
            "fixed_bit_set_memory_in_bytes": 927416,
            "max_unsafe_auto_id_timestamp": 1634108911687,
            "file_sizes": {}
        }
    },
    "nodes": {
        "count": {
            "total": 1,
            "coordinating_only": 0,
            "data": 1,
            "ingest": 1,
            "master": 1,
            "ml": 1,
            "voting_only": 0
        },
        "versions": [
            "7.4.2"
        ],
        "os": {
            "available_processors": 2,
            "allocated_processors": 2,
            "names": [
                {
                    "name": "Linux",
                    "count": 1
                }
            ],
            "pretty_names": [
                {
                    "pretty_name": "Ubuntu 20.04.1 LTS",
                    "count": 1
                }
            ],
            "mem": {
                "total": "1.9gb",
                "total_in_bytes": 2044534784,
                "free": "85.6mb",
                "free_in_bytes": 89849856,
                "used": "1.8gb",
                "used_in_bytes": 1954684928,
                "free_percent": 4,
                "used_percent": 96
            }
        },
        "process": {
            "cpu": {
                "percent": 8
            },
            "open_file_descriptors": {
                "min": 1353,
                "max": 1353,
                "avg": 1353
            }
        },
        "jvm": {
            "max_uptime": "1.2h",
            "max_uptime_in_millis": 4592145,
            "versions": [
                {
                    "version": "13.0.1",
                    "vm_name": "OpenJDK 64-Bit Server VM",
                    "vm_version": "13.0.1+9",
                    "vm_vendor": "AdoptOpenJDK",
                    "bundled_jdk": true,
                    "using_bundled_jdk": true,
                    "count": 1
                }
            ],
            "mem": {
                "heap_used": "535mb",
                "heap_used_in_bytes": 561009184,
                "heap_max": "1007.3mb",
                "heap_max_in_bytes": 1056309248
            },
            "threads": 48
        },
        "fs": {
            "total": "67.7gb",
            "total_in_bytes": 72794869760,
            "free": "58.7gb",
            "free_in_bytes": 63081897984,
            "available": "58.7gb",
            "available_in_bytes": 63065120768
        },
        "plugins": [
            {
                "name": "repository-s3",
                "version": "7.4.2",
                "elasticsearch_version": "7.4.2",
                "java_version": "1.8",
                "description": "The S3 repository plugin adds S3 repositories",
                "classname": "org.elasticsearch.repositories.s3.S3RepositoryPlugin",
                "extended_plugins": [],
                "has_native_controller": false
            }
        ],
        "network_types": {
            "transport_types": {
                "security4": 1
            },
            "http_types": {
                "security4": 1
            }
        },
        "discovery_types": {
            "single-node": 1
        },
        "packaging_types": [
            {
                "flavor": "default",
                "type": "deb",
                "count": 1
            }
        ]
    }
}

(I knowI have replica shards that are unassigned, but it shouldn't be an issue)

There are nothing on the logs when the crash happens :

#Note: log time is UTC
[2021-10-13T01:30:00,010][INFO ][o.e.x.m.a.TransportDeleteExpiredDataAction] [flus-es-dev] Deleting expired data
[2021-10-13T01:30:00,024][INFO ][o.e.x.m.a.TransportDeleteExpiredDataAction] [flus-es-dev] Completed deletion of expired ML data
[2021-10-13T01:30:00,024][INFO ][o.e.x.m.MlDailyMaintenanceService] [flus-es-dev] Successfully completed [ML] maintenance tasks

--- it crashed at Oct 13 03:17:23 according to the kernel log OOM ---
-> restart
[2021-10-13T07:07:51,910][INFO ][o.e.e.NodeEnvironment    ] [flus-es-dev] using [1] data paths, mounts [[/ (/dev/root)]], net usable_space [58.5gb], net total_space [67.7gb], types [ext4]
[2021-10-13T07:07:51,937][INFO ][o.e.e.NodeEnvironment    ] [flus-es-dev] heap size [1007.3mb], compressed ordinary object pointers [true]
[...]

The only thing that seems strange is the gc.log, it is constantly running with "allocation failed" messages, I read elsewhere that it shouldn't be an issue but I find it strange.

--> See attached GC log : [2021-10-13T08:34:59.517+0000][468][gc,start ] GC(596) Pause Young (Allocati - Pastebin.com

And kibana capture (times are UTC+2 Paris) :



Any clues?

1gb of HEAP is generally good for up to 20 shards.
Here, you would need something like 9gb of HEAP for around 180 shards.

-Xms768m
-Xmx1g

You should use the same value here. I'm surprised that you can even start Elasticsearch. But may be it's only a WARN as you are not in production. You should see this check happening: Heap size check | Elasticsearch Guide [7.15] | Elastic

Advices:

  • Increase the HEAP size and if possible the available RAM
  • Or remove non needed indices. (the monitoring ones...)
  • Upgrade to 7.15.0. A lot of things happened in the mean time.

Hello @dadoonet ,

Thanks for the clues, I will change the Xms value but indeed that's not the source of the issue.

If I understand correctly, I should allocate at least 9GB (so about 1GB for 20 shards) of JVM heap so I need to have 18GB of memory since 50% is for the JVM and 50% for the system ?
That's a lot of memory.

What could be causing the OOM error while there are no trafic and no snapshots running?

Does the upgrade to 7.15.0, changes something regarding the memory that would eliminate this issue?

I have to provide a strong and concise answer to my manager, I understand according to Size your shards | Elasticsearch Guide [7.15] | Elastic that we are oversharding, but the point of my manager is "it is working fine for a month, what's causing the sudden crash? can it be fixed without increasing server size?"

Thanks for your answers :slight_smile:

Bonjour Cyril :slight_smile:

Well. The number of indices increased so yes, even if you don't query them, they are taking a significant part of the memory, specifically in the cluster state.
If you don't want to delete the non needed indices, then you should close the ones you are not using.

can it be fixed without increasing server size?

Yes. More likely it will be fine again if you remove the non needed indices.
You should also disable monitoring or index the monitoring data in a separated cluster not in the production cluster itself.

What is the output of:

GET /
GET /_cat/nodes?v
GET /_cat/health?v
GET /_cat/indices?v

If some outputs are too big, please share them on gist.github.com and link them here.

Unfortunately I can't close any indices since they could be needed at anytime by our application.

I'm having a hard time to make 100% sure that the crashes are bound to the oversharding and not anything else.

Here are the outputs you requested:

GET / = GET / · GitHub
GET /_cat/nodes?v = /_cat/nodes?v · GitHub
GET /_cat/health?v = /_cat/health?v · GitHub
GET /_cat/indices?v = GET /_cat/indices?v · GitHub

Could you share the non redacted index names for all indices which are not the ones you are using yourself? Update your gist with that information please.

I updated the gist and left the names of indices that our app doesnt use

You have a lot of indices with 5 or 2 shards . Just use one shard. Unless you have more than 20gb of data, one shard is enough.

As you have only one node, you should set the number of replicas to 0. It does not make sense to ask for replicas if you don't have nodes to replicate the data. That won't change a lot of things but that will just clean up a bit the things.

You have a lot of indices for very few documents. Is there a reason for this? I can't judge by the name (because it's redacted) if they are time based indices or not.
Could you tell more about the use case or why do you have so many tiny indices?

I don't have full visibility of what the devs are doing,

I think that for a particular feature they create an indice per customer due so some constraints, so the number of indices will increase over time but they won't get the same amount of documents (it depends of the usage).

The indices are unrelated (no rollover/time-based) and handled (created) by the application.

You can't expect miracles if you don't have any control with so few ressources.

That could be a bad idea specifically for small datasets and small heaps.
Using filtered aliases might be better.

But again, I don't know the use case.

Coming back to one of your original questions:

You can't have both:

  • no control of what devs are doing (including oversharding)
  • no control of the size of RAM/HEAP available
1 Like