Out of memory of data nodes

Hi,

I have cluster with 5 nodes: 3 master and 2 data nodes.
Data nodes have 24 Gb heap, 12Gb Xmx.

I use daily indexes which automatically created at midnight. Each index is about 30Gb.
Index has 5 shards and 1 replica.

I have one node which receives data in bulk form from producer and second one node as replica. To second one node also Kibana is connected.

Today at midnight I have OutOfMemory exception on both data nodes.

Node 1 - is replica
Node 2 - is data receiver

Logs:
Node2:

[2018-01-26T00:00:14,648][ERROR][o.e.t.n.Netty4Utils      ] fatal error on the network layer
	at org.elasticsearch.transport.netty4.Netty4Utils.maybeDie(Netty4Utils.java:185)
	at org.elasticsearch.http.netty4.Netty4HttpRequestHandler.exceptionCaught(Netty4HttpRequestHandler.java:89)
	at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
	at io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:850)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:364)
	at

[2018-01-26T00:00:21,306][WARN ][o.e.m.j.JvmGcMonitorService] [kib-vm06-data-2-es_data_instance] [gc][4244281] overhead, spent [3.3s] collecting in the last [3.5s]
[2018-01-26T00:00:24,358][WARN ][o.e.m.j.JvmGcMonitorService] [kib-vm06-data-2-es_data_instance] [gc][4244282] overhead, spent [3s] collecting in the last [3s]
[2018-01-26T00:00:24,360][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [kib-vm06-data-2-es_data_instance] fatal error in thread [elasticsearch[kib-vm06-data-2-es_data_instance][bulk][T#2]], exiting
java.lang.OutOfMemoryError: Java heap space
[2018-01-26T00:00:24,361][ERROR][o.e.i.e.Engine           ] [kib-vm06-data-2-es_data_instance] [.monitoring-es-6-2018.01.25][0] tragic event in index writer
java.lang.OutOfMemoryError: Java heap space
[2018-01-26T00:00:24,362][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [kib-vm06-data-2-es_data_instance] fatal error in thread [elasticsearch[kib-vm06-data-2-es_data_instance][bulk][T#4]], exiting
java.lang.OutOfMemoryError: Java heap space
[2018-01-26T00:00:17,793][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [kib-vm06-data-2-es_data_instance] fatal error in thread [elasticsearch[kib-vm06-data-2-es_data_instance][search][T#3]], exiting
java.lang.OutOfMemoryError: Java heap space
	at org.elasticsearch.common.util.PageCacheRecycler$1.newInstance(PageCacheRecycler.java:99) ~[elasticsearch-6.0.1.jar:6.0.1]
	at org.elasticsearch.common.util.PageCacheRecycler$1.newInstance(PageCacheRecycler.java:96) ~[elasticsearch-6.0.1.jar:6.0.1]
	at org.elasticsearch.common.recycler.DequeRecycler.obtain(DequeRecycler.java:53) ~[elasticsearch-6.0.1.jar:6.0.1]
	at org.elasticsearch.common.recycler.AbstractRecycler.obtain(AbstractRecycler.java:33) ~[elasticsearch-6.0.1.jar:6.0.1]
	at org.elasticsearch.common.recycler.DequeRecycler.obtain(DequeRecycler.java:28) ~[elasticsearch-6.0.1.jar:6.0.1]
	at org.elasticsearch.common.recycler.FilterRecycler.obtain(FilterRecycler.java:39) ~[elasticsearch-6.0.1.jar:6.0.1]
	at org.elasticsearch.common.recycler.Recyclers$3.obtain(Recyclers.java:119) ~[elasticsearch-6.0.1.jar:6.0.1]
	at org.elasticsearch.common.recycler.FilterRecycler.obtain(FilterRecycler.java:39) ~[elasticsearch-6.0.1.jar:6.0.1]
	at org.elasticsearch.common.util.PageCacheRecycler.bytePage(PageCacheRecycler.java:147) ~[elasticsearch-6.0.1.jar:6.0.1]
	at org.elasticsearch.common.util.AbstractBigArray.newBytePage(AbstractBigArray.java:117) ~[elasticsearch-6.0.1.jar:6.0.1]
	at org.elasticsearch.common.util.BigByteArray.resize(BigByteArray.java:143) ~[elasticsearch-6.0.1.jar:6.0.1]
	at org.elasticsearch.common.util.BigArrays.resizeInPlace(BigArrays.java:449) ~[elasticsearch-6.0.1.jar:6.0.1]
	at org.elasticsearch.common.util.BigArrays.resize(BigArrays.java:496) ~[elasticsearch-6.0.1.jar:6.0.1]
	at org.elasticsearch.common.util.BigArrays.grow(BigArrays.java:513) ~[elasticsearch-6.0.1.jar:6.0.1]

Node 1

[2018-01-26T00:00:15,215][ERROR][o.e.x.m.c.n.NodeStatsCollector] [kib-vm05-data-1-es_data_instance] collector [node_stats] timed out when collecting data
[2018-01-26T00:00:20,967][WARN ][o.e.m.j.JvmGcMonitorService] [kib-vm05-data-1-es_data_instance] [gc][3224345] overhead, spent [22.7s] collecting in the last [22.7s]
[2018-01-26T00:00:30,672][WARN ][o.e.m.j.JvmGcMonitorService] [kib-vm05-data-1-es_data_instance] [gc][3224346] overhead, spent [9.6s] collecting in the last [9.7s]
[2018-01-26T00:00:30,678][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [kib-vm05-data-1-es_data_instance] fatal error in thread [elasticsearch[kib-vm05-data-1-es_data_instance][search][T#4]], exiting
java.lang.OutOfMemoryError: Java heap space
	at org.elasticsearch.common.util.PageCacheRecycler$1.newInstance(PageCacheRecycler.java:99) ~[elasticsearch-6.0.1.jar:6.0.1]
	at org.elasticsearch.common.util.PageCacheRecycler$1.newInstance(PageCacheRecycler.java:96) ~[elasticsearch-6.0.1.jar:6.0.1]
	at org.elasticsearch.common.recycler.DequeRecycler.obtain(DequeRecycler.java:53) ~[elasticsearch-6.0.1.jar:6.0.1]
	at org.elasticsearch.common.recycler.AbstractRecycler.obtain(AbstractRecycler.java:33) ~[elasticsearch-6.0.1.jar:6.0.1]
	at org.elasticsearch.common.recycler.DequeRecycler.obtain(DequeRecycler.java:28) ~[elasticsearch-6.0.1.jar:6.0.1]
	at org.elasticsearch.common.recycler.FilterRecycler.obtain(FilterRecycler.java:39) ~[elasticsearch-6.0.1.jar:6.0.1]
	at org.elasticsearch.common.recycler.Recyclers$3.obtain(Recyclers.java:119) ~[elasticsearch-6.0.1.jar:6.0.1]
	at org.elasticsearch.common.recycler.FilterRecycler.obtain(FilterRecycler.java:39) ~[elasticsearch-6.0.1.jar:6.0.1]
	at org.elasticsearch.common.util.PageCacheRecycler.bytePage(PageCacheRecycler.java:147) ~[elasticsearch-6.0.1.jar:6.0.1]
	at org.elasticsearch.common.util.AbstractBigArray.newBytePage(AbstractBigArray.java:117) ~[elasticsearch-6.0.1.jar:6.0.1]
	at org.elasticsearch.common.util.BigByteArray.resize(BigByteArray.java:143) ~[elasticsearch-6.0.1.jar:6.0.1]
	at org.elasticsearch.common.util.BigArrays.resizeInPlace(BigArrays.java:449) ~[elasticsearch-6.0.1.jar:6.0.1]
	at org.elasticsearch.common.util.BigArrays.resize(BigArrays.java:496) ~[elasticsearch-6.0.1.jar:6.0.1]
	at org.elasticsearch.common.util.BigArrays.grow(BigArrays.java:513) ~[elasticsearch-6.0.1.jar:6.0.1]
	at org.elasticsearch.search.aggregations.metrics.cardinality.HyperLogLogPlusPlus.ensureCapacity(HyperLogLogPlusPlus.java:202) ~[elasticsearch-6.0.1.jar:6.0.1]
	at

Which version are you using? What is the full output of the cluster stats API?

Version is 6.0.1

Unfortunately I haven't stat snapshot exactly after failure. Nodes were restarted and now all is ok.

Current stat here: https://www.dropbox.com/s/b0d71b9cife6x48/stat.txt?dl=0

That seems to be output from the cluster state API, not the cluster stats API.

{
  "_nodes" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "cluster_name" : "production-cluster",
  "timestamp" : 1516957551458,
  "status" : "green",
  "indices" : {
    "count" : 269,
    "shards" : {
      "total" : 1000,
      "primaries" : 500,
      "replication" : 1.0,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 10,
          "avg" : 3.717472118959108
        },
        "primaries" : {
          "min" : 1,
          "max" : 5,
          "avg" : 1.858736059479554
        },
        "replication" : {
          "min" : 1.0,
          "max" : 1.0,
          "avg" : 1.0
        }
      }
    },
    "docs" : {
      "count" : 5605366880,
      "deleted" : 157973
    },
    "store" : {
      "size" : "3.8tb",
      "size_in_bytes" : 4215510892243
    },
    "fielddata" : {
      "memory_size" : "42.5mb",
      "memory_size_in_bytes" : 44582216,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "127.4mb",
      "memory_size_in_bytes" : 133641386,
      "total_count" : 475495,
      "hit_count" : 165967,
      "miss_count" : 309528,
      "cache_size" : 18026,
      "cache_count" : 22676,
      "evictions" : 4650
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 1832,
      "memory" : "5.6gb",
      "memory_in_bytes" : 6084316454,
      "terms_memory" : "3.4gb",
      "terms_memory_in_bytes" : 3659080227,
      "stored_fields_memory" : "1.5gb",
      "stored_fields_memory_in_bytes" : 1646654328,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "110.6kb",
      "norms_memory_in_bytes" : 113280,
      "points_memory" : "728.4mb",
      "points_memory_in_bytes" : 763829035,
      "doc_values_memory" : "13.9mb",
      "doc_values_memory_in_bytes" : 14639584,
      "index_writer_memory" : "8.2mb",
      "index_writer_memory_in_bytes" : 8647240,
      "version_map_memory" : "260b",
      "version_map_memory_in_bytes" : 260,
      "fixed_bit_set" : "0b",
      "fixed_bit_set_memory_in_bytes" : 0,
      "max_unsafe_auto_id_timestamp" : 1516948550419,
      "file_sizes" : { }
    }
  },
  "nodes" : {
    "count" : {
      "total" : 5,
      "data" : 2,
      "coordinating_only" : 0,
      "master" : 3,
      "ingest" : 5
    },
    "versions" : [
      "6.0.1"
    ],
    "os" : {
      "available_processors" : 20,
      "allocated_processors" : 20,
      "names" : [
        {
          "name" : "Linux",
          "count" : 5
        }
      ],
      "mem" : {
        "total" : "58.5gb",
        "total_in_bytes" : 62913048576,
        "free" : "2.2gb",
        "free_in_bytes" : 2371362816,
        "used" : "56.3gb",
        "used_in_bytes" : 60541685760,
        "free_percent" : 4,
        "used_percent" : 96
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 11
      },
      "open_file_descriptors" : {
        "min" : 290,
        "max" : 1374,
        "avg" : 723
      }
    },
    "jvm" : {
      "max_uptime" : "49.7d",
      "max_uptime_in_millis" : 4299700186,
      "versions" : [
        {
          "version" : "1.8.0_151",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "25.151-b12",
          "vm_vendor" : "Oracle Corporation",
          "count" : 5
        }
      ],
      "mem" : {
        "heap_used" : "9.1gb",
        "heap_used_in_bytes" : 9844503472,
        "heap_max" : "29.8gb",
        "heap_max_in_bytes" : 32037928960
      },
      "threads" : 285
    },
    "fs" : {
      "total" : "7.8tb",
      "total_in_bytes" : 8596162236416,
      "free" : "3.9tb",
      "free_in_bytes" : 4371546157056,
      "available" : "3.9tb",
      "available_in_bytes" : 4371546157056
    },
    "plugins" : [
      {
        "name" : "x-pack",
        "version" : "6.0.1",
        "description" : "Elasticsearch Expanded Pack Plugin",
        "classname" : "org.elasticsearch.xpack.XPackPlugin",
        "has_native_controller" : true,
        "requires_keystore" : true
      }
    ],
    "network_types" : {
      "transport_types" : {
        "netty4" : 5
      },
      "http_types" : {
        "netty4" : 5
      }
    }
  }
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.