ECE Weird Full Restart Deployment

Hi,
I met full restart deployment case when I scaled 3 nodes to 6 nodes.

This is very weird and scared because a full restart deployment means cluster is not available for user during this time.

During this plan, we keep writing documents to cluster. And I notice one node has heap pressure like below.

So why ece do a full restart deployment even if I choose a rolling strategy.

My plan details is as below:

    {
  "tiebreaker_topology": {
    "memory_per_node": 1024
  },
  "elasticsearch": {
    "version": "5.6.14",
    "system_settings": {
      "use_disk_threshold": true
    }
  },
  "transient": {
    "strategy": {
      "rolling": {
        "group_by": "__all__"
      }
    },
    "plan_configuration": {
      "preferred_allocators": [],
      "max_snapshot_attempts": 3,
      "move_allocators": [],
      "skip_snapshot": true,
      "move_instances": [],
      "skip_post_upgrade_steps": false,
      "cluster_reboot": "forced",
      "extended_maintenance": false,
      "skip_upgrade_checker": false,
      "override_failsafe": false,
      "skip_data_migration": false,
      "calm_wait_time": 5,
      "reallocate_instances": false,
      "timeout": 4096,
      "move_only": false
    }
  },
  "cluster_topology": [
    {
      "memory_per_node": 1024,
      "node_type": {
        "master": true,
        "data": true,
        "ingest": true,
        "ml": false
      },
      "instance_configuration_id": "c3fd8f77dffe421897f870339dc13015",
      "elasticsearch": {
        "system_settings": {
          "enable_close_index": false,
          "use_disk_threshold": true,
          "monitoring_collection_interval": -1,
          "monitoring_history_duration": "7d",
          "destructive_requires_name": false,
          "reindex_whitelist": [],
          "auto_create_index": true,
          "watcher_trigger_engine": "scheduler",
          "scripting": {
            "inline": {
              "enabled": true,
              "sandbox_mode": true
            },
            "expressions_enabled": true,
            "stored": {
              "enabled": true,
              "sandbox_mode": true
            },
            "file": {
              "enabled": true,
              "sandbox_mode": true
            },
            "mustache_enabled": true,
            "painless_enabled": true
          },
          "http": {
            "compression": true,
            "cors_enabled": false,
            "cors_max_age": 1728000,
            "cors_allow_credentials": false
          }
        },
        "enabled_built_in_plugins": [],
        "user_plugins": [],
        "user_bundles": []
      },
      "zone_count": 3,
      "node_count_per_zone": 2
    }
  ],
  "deployment_template": {
    "id": "4a17673d984b462ba8a2cbb8875bf4d9"
  }
}

And I found that the node with high memory pressure exit with following container logs.

usermod: no changes


*** Running /etc/my_init.d/00_regen_ssh_host_keys.sh...


*** Running /etc/rc.local...


*** Running setuser founduser /app/elasticsearch.sh...


2019-02-26T04:30:14+0000 Booting at Tue Feb 26 04:30:14 UTC 2019


2019-02-26T04:30:14+0000 Enabling QuotaAwareFileSystemProvider


2019-02-26T04:30:14+0000 Installing user plugins.


2019-02-26T04:30:14+0000 Installing user bundles.


2019-02-26T04:30:14+0000 No user bundles defined


2019-02-26T04:30:14+0000 Done installing plugins and bundles, verifying required config files.


2019-02-26T04:30:14+0000 Done verifying required config files, starting Elasticsearch. See Elasticsearch logs for further output.


SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".


SLF4J: Defaulting to no-operation (NOP) logger implementation


SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.


/app/elasticsearch.sh: line 167:    29 Killed                  /elasticsearch/bin/elasticsearch -p /app/es.pid $CONFIG_OPTIONS $*


Elasticsearch exited with status 137, running ./on-error-exitcode.sh 137


+ exec


*** setuser exited with status 137.


*** Killing all processes..

As I know, 137 means oom killer.

But if es jvm heap has been set less than half of the container memory, why does still oom killing case occur?

I found the reason.

I did a full restart on UI and later deployed a new plan based on manual advanced configuration which I forgot to delete cluster_restart config.

:joy:

As I know, 137 means oom killer.

But if es jvm heap has been set less than half of the container memory, why does still oom killing case occur?

We have a script that detects when ES is so busy GC'ing that it can't do anything and we restart in that case (using the same return code as the system OOM killer)

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.