Out of memory issue on one node caused cluster failure?

Hi,

Came in this morning to find our pre-prod environment has spat out it's dummy over the weekend... just in time for our planned go-live on Thursday!

it would appear that we lost one out of three nodes - but the rest over the nodes were no longer available ? (cluster went to red status and showing as no shards available)

We are running 3x boxes on AWS (linux). there's only usually around 300-400MB data in the index (so should be a walk in the park)

so a 2 pronged question...

  1. are we able to pin down why one server had the OOM issue? (truncated logs below) (any pointers on this appreciated
  2. am I missing a setting that would force a re-election for master?

Thanks

Phil

[2017-11-24T19:02:11,932][WARN ][o.e.j.s.ServletHandler   ] Error for /availability%2f768/_bulk_docs
java.lang.OutOfMemoryError: Java heap space
.
.
[2017-11-24T19:04:48,579][WARN ][o.e.t.n.Netty4Transport  ] [node-2] exception caught on transport layer [[id: 0x95249d98, L:/172.31.11.105:9300 - R:/172.31.8.125:45032]], closing connection
io.netty.handler.codec.DecoderException: java.lang.OutOfMemoryError: Java heap space
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:269) ~[netty-codec-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-common-4.1.11.Final.jar:4.1.11.Final]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]
Caused by: java.lang.OutOfMemoryError: Java heap space

What is the full output of the cluster stats API?

running cluster health looks like this:

{
    "cluster_name": "Production",
    "status": "green",
    "timed_out": false,
    "number_of_nodes": 3,
    "number_of_data_nodes": 3,
    "active_primary_shards": 1,
    "active_shards": 1,
    "relocating_shards": 0,
    "initializing_shards": 0,
    "unassigned_shards": 0,
    "delayed_unassigned_shards": 0,
    "number_of_pending_tasks": 0,
    "number_of_in_flight_fetch": 0,
    "task_max_waiting_in_queue_millis": 0,
    "active_shards_percent_as_number": 100
}

config on each of the nodes is this (except for machine specific items)

# ---------------------------------- Cluster  -----------------------------------
cluster.name: Production
# ------------------------------------ Node  ------------------------------------
node.name: node-2
# ----------------------------------- Memory  -----------------------------------
bootstrap.memory_lock: true
# ---------------------------------- Network  -----------------------------------
network.host: _ec2:privateIpv4_
http.port: 9200
# --------------------------------- Discovery  ----------------------------------
cloud.aws.access_key: [Key was here]
cloud.aws.secret_key: [Secret was here]
cloud.aws.protocol: https
cloud.aws.ec2.protocol: https
cloud.aws.region: eu-west
cloud.aws.read_timeout: 30s
discovery.ec2.host_type: private_ip
discovery.ec2.tag.elastic: cluster1
discovery.zen.hosts_provider: ec2
discovery.zen.ping.unicast.hosts: ["172.xx.x.xxx", "172.xx.xx.xxx", "172.xx.x.xxx"]
discovery.zen.minimum_master_nodes: 2
# ---------------------------------- Various -----------------------------------
action.destructive_requires_name: true
# --------------------------------- Couchbase ----------------------------------
couchbase.port: 9091
couchbase.username: [username was here]
couchbase.password: [password was here]
# ------------------------------------------------------------------------------

Hi Christian,

sorry, that last reply crossed with yours.

Output from the stats API here:

{
    "_nodes": {
        "total": 3,
        "successful": 3,
        "failed": 0
    },
    "cluster_name": "Production",
    "timestamp": 1511784102432,
    "status": "green",
    "indices": {
        "count": 1,
        "shards": {
            "total": 1,
            "primaries": 1,
            "replication": 0,
            "index": {
                "shards": {
                    "min": 1,
                    "max": 1,
                    "avg": 1
                },
                "primaries": {
                    "min": 1,
                    "max": 1,
                    "avg": 1
                },
                "replication": {
                    "min": 0,
                    "max": 0,
                    "avg": 0
                }
            }
        },
        "docs": {
            "count": 1699222,
            "deleted": 128187
        },
        "store": {
            "size": "889.5mb",
            "size_in_bytes": 932749664,
            "throttle_time": "0s",
            "throttle_time_in_millis": 0
        },
        "fielddata": {
            "memory_size": "7.3kb",
            "memory_size_in_bytes": 7504,
            "evictions": 0
        },
        "query_cache": {
            "memory_size": "15.5mb",
            "memory_size_in_bytes": 16277768,
            "total_count": 61722,
            "hit_count": 14284,
            "miss_count": 47438,
            "cache_size": 559,
            "cache_count": 602,
            "evictions": 43
        },
        "completion": {
            "size": "0b",
            "size_in_bytes": 0
        },
        "segments": {
            "count": 16,
            "memory": "2.7mb",
            "memory_in_bytes": 2843378,
            "terms_memory": "1.7mb",
            "terms_memory_in_bytes": 1789631,
            "stored_fields_memory": "67.8kb",
            "stored_fields_memory_in_bytes": 69432,
            "term_vectors_memory": "0b",
            "term_vectors_memory_in_bytes": 0,
            "norms_memory": "137.1kb",
            "norms_memory_in_bytes": 140480,
            "points_memory": "102.1kb",
            "points_memory_in_bytes": 104651,
            "doc_values_memory": "721.8kb",
            "doc_values_memory_in_bytes": 739184,
            "index_writer_memory": "0b",
            "index_writer_memory_in_bytes": 0,
            "version_map_memory": "0b",
            "version_map_memory_in_bytes": 0,
            "fixed_bit_set": "447.5kb",
            "fixed_bit_set_memory_in_bytes": 458256,
            "max_unsafe_auto_id_timestamp": -1,
            "file_sizes": {}
        }
    },
    "nodes": {
        "count": {
            "total": 3,
            "data": 3,
            "coordinating_only": 0,
            "master": 3,
            "ingest": 3
        },
        "versions": [
            "5.5.2"
        ],
        "os": {
            "available_processors": 12,
            "allocated_processors": 12,
            "names": [
                {
                    "name": "Linux",
                    "count": 3
                }
            ],
            "mem": {
                "total": "21.9gb",
                "total_in_bytes": 23533596672,
                "free": "7.8gb",
                "free_in_bytes": 8423940096,
                "used": "14gb",
                "used_in_bytes": 15109656576,
                "free_percent": 36,
                "used_percent": 64
            }
        },
        "process": {
            "cpu": {
                "percent": 3
            },
            "open_file_descriptors": {
                "min": 258,
                "max": 260,
                "avg": 259
            }
        },
        "jvm": {
            "max_uptime": "17.9d",
            "max_uptime_in_millis": 1553499938,
            "versions": [
                {
                    "version": "1.8.0_141",
                    "vm_name": "OpenJDK 64-Bit Server VM",
                    "vm_version": "25.141-b16",
                    "vm_vendor": "Oracle Corporation",
                    "count": 3
                }
            ],
            "mem": {
                "heap_used": "2gb",
                "heap_used_in_bytes": 2206968472,
                "heap_max": "5.9gb",
                "heap_max_in_bytes": 6337855488
            },
            "threads": 213
        },
        "fs": {
            "total": "88.2gb",
            "total_in_bytes": 94711504896,
            "free": "68.8gb",
            "free_in_bytes": 73932374016,
            "available": "68.5gb",
            "available_in_bytes": 73624412160
        },
        "plugins": [
            {
                "name": "discovery-ec2",
                "version": "5.5.2",
                "description": "The EC2 discovery plugin allows to use AWS API for the unicast discovery mechanism.",
                "classname": "org.elasticsearch.discovery.ec2.Ec2DiscoveryPlugin",
                "has_native_controller": false
            },
            {
                "name": "transport-couchbase",
                "version": "2.5.5.2",
                "description": "Couchbase to Elasticsearch Transport",
                "classname": "org.elasticsearch.plugin.transport.couchbase.CouchbaseCAPITransportPlugin",
                "has_native_controller": false
            }
        ],
        "network_types": {
            "transport_types": {
                "netty4": 3
            },
            "http_types": {
                "netty4": 3
            }
        }
    }
}

It looks like you have a 3 node cluster with only a single shard. This means that only one node hold data. I would recommend at least adding one or two replica shards, but also maybe change to 3 primary shards so you can utilise your hardware better.

It seems like you have a good amount of available heap now, so I am not sure what caused the OOM. Are you sending very large bulk requests or perhaps running very heavy aggregations?

I also do not have any experience using the transport-couchbase plugin, so do not know if there are any potential issues related to this that could have contributed.

I would recommend you install X-Pack monitoring so that it is easier to troubleshoot these kind of issues.

Thanks Christian,

It's possible that the couchbase plugin did send bulk requests as this is basically a couchbase > elastic replication plugin.

Am I right in saying 3 Primary shards distributes the query across the shards? and replica shards one server handles the request?

If I'm correct, the primary shards probably won't work for us as we're utilising facet counts & sorting which I've read on the ES site is prone to inconsistencies when distributed?

Regards

Phil

For each query, either the primary or replica need to be queried for each shard. Replica shards are a copy of the primary and is what allows your cluster to continue working if a node goes down.

Certain types of aggregations are approximations when used across multiple shards if the data volumes are large enough. If you want to avoid this and you have a small amount of data, use a single primary shard with 2 replica shards configured. This means you have better resilience and that all nodes can respond to queries.

Many thanks..

Will give this a try now :smiley:

Regards

Phil

Hi,

The API output now looks like :

{
    "_nodes": {
        "total": 3,
        "successful": 3,
        "failed": 0
    },
    "cluster_name": "Production",
    "timestamp": 1511791569016,
    "status": "green",
    "indices": {
        "count": 2,
        "shards": {
            "total": 13,
            "primaries": 6,
            "replication": 1.1666666666666667,
            "index": {
                "shards": {
                    "min": 3,
                    "max": 10,
                    "avg": 6.5
                },
                "primaries": {
                    "min": 1,
                    "max": 5,
                    "avg": 3
                },
                "replication": {
                    "min": 1,
                    "max": 2,
                    "avg": 1.5
                }
            }
        },
        "docs": {
            "count": 1699222,
            "deleted": 128187
        },
        "store": {
            "size": "2.6gb",
            "size_in_bytes": 2798250610,
            "throttle_time": "0s",
            "throttle_time_in_millis": 0
        },
        "fielddata": {
            "memory_size": "7.3kb",
            "memory_size_in_bytes": 7504,
            "evictions": 0
        },
        "query_cache": {
            "memory_size": "24.7mb",
            "memory_size_in_bytes": 25989608,
            "total_count": 162897,
            "hit_count": 38493,
            "miss_count": 124404,
            "cache_size": 929,
            "cache_count": 972,
            "evictions": 43
        },
        "completion": {
            "size": "0b",
            "size_in_bytes": 0
        },
        "segments": {
            "count": 48,
            "memory": "8.1mb",
            "memory_in_bytes": 8527542,
            "terms_memory": "5.1mb",
            "terms_memory_in_bytes": 5368893,
            "stored_fields_memory": "203.4kb",
            "stored_fields_memory_in_bytes": 208296,
            "term_vectors_memory": "0b",
            "term_vectors_memory_in_bytes": 0,
            "norms_memory": "411.5kb",
            "norms_memory_in_bytes": 421440,
            "points_memory": "306.5kb",
            "points_memory_in_bytes": 313953,
            "doc_values_memory": "2.1mb",
            "doc_values_memory_in_bytes": 2214960,
            "index_writer_memory": "0b",
            "index_writer_memory_in_bytes": 0,
            "version_map_memory": "0b",
            "version_map_memory_in_bytes": 0,
            "fixed_bit_set": "1.3mb",
            "fixed_bit_set_memory_in_bytes": 1374768,
            "max_unsafe_auto_id_timestamp": -1,
            "file_sizes": {}
        }
    },
    "nodes": {
        "count": {
            "total": 3,
            "data": 3,
            "coordinating_only": 0,
            "master": 3,
            "ingest": 3
        },
        "versions": [
            "5.5.2"
        ],
        "os": {
            "available_processors": 12,
            "allocated_processors": 12,
            "names": [
                {
                    "name": "Linux",
                    "count": 3
                }
            ],
            "mem": {
                "total": "21.9gb",
                "total_in_bytes": 23533596672,
                "free": "5.9gb",
                "free_in_bytes": 6385864704,
                "used": "15.9gb",
                "used_in_bytes": 17147731968,
                "free_percent": 27,
                "used_percent": 73
            }
        },
        "process": {
            "cpu": {
                "percent": 0
            },
            "open_file_descriptors": {
                "min": 264,
                "max": 269,
                "avg": 266
            }
        },
        "jvm": {
            "max_uptime": "18d",
            "max_uptime_in_millis": 1560966523,
            "versions": [
                {
                    "version": "1.8.0_141",
                    "vm_name": "OpenJDK 64-Bit Server VM",
                    "vm_version": "25.141-b16",
                    "vm_vendor": "Oracle Corporation",
                    "count": 3
                }
            ],
            "mem": {
                "heap_used": "1.7gb",
                "heap_used_in_bytes": 1896065120,
                "heap_max": "5.9gb",
                "heap_max_in_bytes": 6337855488
            },
            "threads": 223
        },
        "fs": {
            "total": "88.2gb",
            "total_in_bytes": 94711504896,
            "free": "67.1gb",
            "free_in_bytes": 72065830912,
            "available": "66.8gb",
            "available_in_bytes": 71757869056
        },
        "plugins": [
            {
                "name": "discovery-ec2",
                "version": "5.5.2",
                "description": "The EC2 discovery plugin allows to use AWS API for the unicast discovery mechanism.",
                "classname": "org.elasticsearch.discovery.ec2.Ec2DiscoveryPlugin",
                "has_native_controller": false
            },
            {
                "name": "transport-couchbase",
                "version": "2.5.5.2",
                "description": "Couchbase to Elasticsearch Transport",
                "classname": "org.elasticsearch.plugin.transport.couchbase.CouchbaseCAPITransportPlugin",
                "has_native_controller": false
            }
        ],
        "network_types": {
            "transport_types": {
                "netty4": 3
            },
            "http_types": {
                "netty4": 3
            }
        }
    }
}

I'm assuming the fact shards has jumped to 13 is just the failover indexes being counted?

thanks

You can use the cat shards API to see what shards you have in the cluster.

Sorted thanks...

inadvertantly created a 2nd index !

it now reads 3 (1 primary and 2 replicas)

many thanks for your help..

Regards

Phil

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.