Out of Memory on 2.3.1 (cluster id 14de95)

James_Faulkner · April 18, 2016, 12:45pm

Hi,

I am currently in the process of trying to upgrade our Elastic Cloud clusters from 1.7.3 to 2.3.1.

For testing I have created a new cluster (ID is 14de95) which is running 2.3.1. This was setup to have the same settings and mappings as our current non PROD cluster (ID is bf0245).

However, I have been having some issues whilst indexing all of our data into the 2.3.1 instance. From my application, whenever I trigger all the of the data to be indexed into Elasticsearch 2.3.1 the cluster runs out of memory, normally after about 1 hour of indexing. This process works fine when indexing against 1.7.3. I have tried increasing the RAM available for the cluster (from 4GB to 8 GB) and increasing the refresh_interval from the default value to 30s. Neither of these things seemed to help. (4GB and the default refresh_interval works without any noticeable issues on our 1.7.3 cluster).

Monitoring the node using the stats API, I did notice that the segments.index_writer_memory value seemed to be increasing in size more than I'd expect. During a complete index I saw this value reach 2.2GB using Elasticsearch 2.3.1 just before the OutOfMemoryError. I've included the details below. Using Elasticsearch 1.7.3 I never saw this value go above 11Mb.

    "segments" : {
      "count" : 30,
      "memory" : "17.4mb",
      "memory_in_bytes" : 18259186,
      "terms_memory" : "17.1mb",
      "terms_memory_in_bytes" : 17987986,
      "stored_fields_memory" : "35kb",
      "stored_fields_memory_in_bytes" : 35896,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "63.8kb",
      "norms_memory_in_bytes" : 65408,
      "doc_values_memory" : "165.9kb",
      "doc_values_memory_in_bytes" : 169896,
      "index_writer_memory" : "2.2gb",
      "index_writer_memory_in_bytes" : 2419813928,
      "index_writer_max_memory" : "395.9mb",
      "index_writer_max_memory_in_bytes" : 415183667,
      "version_map_memory" : "91.2kb",
      "version_map_memory_in_bytes" : 93477,
      "fixed_bit_set" : "13kb",
      "fixed_bit_set_memory_in_bytes" : 13336
    },

I'd have thought that this value would decrease automatically with Elasticsearch calling flush without any intervention from me.

I'd be very grateful if someone could point me towards what could be causing this?

Thanks,
James

igor_k · April 18, 2016, 12:52pm

Hi @James_Faulkner,

I've moved the post to the elasticsearch category, as I think you'll get better help here. It looks like the issue is elasticsearch 2.3.1 related.

Thanks,
Igor

jpountz · April 18, 2016, 1:24pm

It is unexpected that segments.index_writer_memory reaches such a high value with the default configuration. Do you happen to have non-default values for some settings? Anything particular with your setup, many indices maybe?

James_Faulkner · April 18, 2016, 1:49pm

Hi @jpountz

Thanks for the response.

There's nothing particularly strange about our settings...

{
  "myindex-dev" : {
    "settings" : {
      "index" : {
        "refresh_interval" : "30s",
        "number_of_shards" : "1",
        "creation_date" : "1460718140781",
        "analysis" : {
          "filter" : {
            "meal_synonyms" : {
              "type" : "synonym",
              "synonyms" : [ "prawn,shrimp", "eggplant,aubergine", "fries,chips", "zucchini,courgette" ]
            },
            "no_stem" : {
              "keywords" : [ "fries" ],
              "type" : "keyword_marker"
            }
          },
          "analyzer" : {
            "string_lowercase" : {
              "filter" : "lowercase",
              "tokenizer" : "keyword"
            },
            "default" : {
              "filter" : [ "lowercase", "stop", "meal_synonyms", "no_stem", "kstem" ],
              "type" : "custom",
              "tokenizer" : "whitespace"
            }
          }
        },
        "number_of_replicas" : "1",
        "uuid" : "yr_YaiAITEi-nZmGdZsPoA",
        "version" : {
          "created" : "2030199"
        }
      }
    }
  }
}

Other than the refresh_interval being increased to 30s, these are the same settings that we are using on 1.7.3 without any issues.

Cheers,
James

jpountz · April 18, 2016, 2:39pm

OK, so you did not touch the node level setting indices.memory.index_buffer_size or indices.memory.min_index_buffer_size? By default segments.index_writer_memory should never be larger than 10% of the JVM heap, which is ~410MB on a 4GB heap.

Do you have any warnings in the logs?

James_Faulkner · April 18, 2016, 3:12pm

There are a lot of warnings coming from monitor.jvm like this

[2016-04-15 12:47:12,694][WARN ][monitor.jvm ] [gc][young][25][2] duration [1s], collections [1]/[2.1s], total [1s]/[1.3s], memory [1gb]->[123.8mb]/[3.8gb], all_pools {[young] [1gb]->[40.4mb]/[1gb]}{[survivor] [47.8mb]->[83.4mb]/[136.5mb]}{[old] [0b]->[0b]/[2.6gb]}

There are also a number of timeouts like

[2016-04-15 12:48:01,989][WARN ][rest.suppressed ] /myindex-dev/mealmetadata/6019034233438208/_update Params: {index=myindex-dev, id=6019034233438208, type=meal} UnavailableShardsException[[marketplace-dev][0] [1] shardIt, [0] active : Timeout waiting for [1m], request: indices:data/write/update]
at org.elasticsearch.action.support.single.instance.TransportInstanceSingleOperationAction$AsyncSingleAction.retry(TransportInstanceSingleOperationAction.java:213)
at org.elasticsearch.action.support.single.instance.TransportInstanceSingleOperationAction$AsyncSingleAction.doStart(TransportInstanceSingleOperationAction.java:168)
at org.elasticsearch.action.support.single.instance.TransportInstanceSingleOperationAction$AsyncSingleAction$2.onTimeout(TransportInstanceSingleOperationAction.java:234)
at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:236)
at org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:804)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

However, these all appear to be happening a while after the full index process has started. I haven't actually spotted warnings in the logs before the indexing has kicked in.

James_Faulkner · April 19, 2016, 9:19am

@jpountz

I attempted this again on a new cluster this morning and noticed these warnings, though I'm not sure that they it is related.

[2016-04-19 08:38:49,956][WARN ][com.amazonaws.jmx.SdkMBeanRegistrySupport] java.security.AccessControlException: access denied ("javax.management.MBeanServerPermission" "findMBeanServer")
at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)
at java.security.AccessController.checkPermission(AccessController.java:884)
at java.lang.SecurityManager.checkPermission(SecurityManager.java:549)
at javax.management.MBeanServerFactory.checkPermission(MBeanServerFactory.java:413)
at javax.management.MBeanServerFactory.findMBeanServer(MBeanServerFactory.java:361)
at com.amazonaws.jmx.MBeans.getMBeanServer(MBeans.java:111)
at com.amazonaws.jmx.MBeans.registerMBean(MBeans.java:50)
at com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27)
at com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:355)
at com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:316)
at com.amazonaws.services.s3.AmazonS3Client.(AmazonS3Client.java:282)
at org.elasticsearch.cloud.aws.InternalAwsS3Service.getClient(InternalAwsS3Service.java:155)
at org.elasticsearch.cloud.aws.InternalAwsS3Service.client(InternalAwsS3Service.java:81)
at org.elasticsearch.repositories.s3.S3Repository.(S3Repository.java:131)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.elasticsearch.common.inject.DefaultConstructionProxyFactory$1.newInstance(DefaultConstructionProxyFactory.java:50)
at org.elasticsearch.common.inject.ConstructorInjector.construct(ConstructorInjector.java:86)
at org.elasticsearch.common.inject.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:104)
at org.elasticsearch.common.inject.FactoryProxy.get(FactoryProxy.java:54)
at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:47)
at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:887)
at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:43)
at org.elasticsearch.common.inject.Scopes$1$1.get(Scopes.java:59)
at org.elasticsearch.common.inject.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:46)
at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:201)
at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:193)
at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:880)
at org.elasticsearch.common.inject.InjectorBuilder.loadEagerSingletons(InjectorBuilder.java:193)
at org.elasticsearch.common.inject.InjectorBuilder.injectDynamically(InjectorBuilder.java:175)
at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:110)
at org.elasticsearch.common.inject.InjectorImpl.createChildInjector(InjectorImpl.java:162)
at org.elasticsearch.common.inject.ModulesBuilder.createChildInjector(ModulesBuilder.java:55)
at org.elasticsearch.repositories.RepositoriesService.createRepositoryHolder(RepositoriesService.java:404)
at org.elasticsearch.repositories.RepositoriesService.registerRepository(RepositoriesService.java:368)
at org.elasticsearch.repositories.RepositoriesService.access$100(RepositoriesService.java:55)
at org.elasticsearch.repositories.RepositoriesService$1.execute(RepositoriesService.java:110)
at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:45)
at org.elasticsearch.cluster.service.InternalClusterService.runTasksForExecutor(InternalClusterService.java:468)
at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:772)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

James_Faulkner · April 19, 2016, 9:22am

I also tried calling the flush API directly, but that always results in the following...

{
  "_shards" : {
    "total" : 2,
    "successful" : 0,
    "failed" : 2,
    "failures" : [ {
      "shard" : 0,
      "index" : "myindex-dev",
      "status" : "SERVICE_UNAVAILABLE",
      "reason" : {
        "type" : "flush_not_allowed_engine_exception",
        "reason" : "already flushing...",
        "shard" : "0",
        "index" : "myindex-dev"
      }
    } ]
  }
}

dadoonet · April 19, 2016, 9:39am

I attempted this again on a new cluster this morning and noticed these warnings, though I'm not sure that they it is related.

It's not.

About this message, did you change logging.yml?
We by default hide this.

James_Faulkner · April 19, 2016, 11:41am

Hey @dadoonet

I haven't touched logging.yml. Our cluster is deployed on Elastic Cloud so is using whatever the default logging.yaml file is for that.

dadoonet · April 19, 2016, 12:42pm

Interesting. Thanks for the information. It helps.

jpountz · April 19, 2016, 3:31pm

It looks like something is preventing elasticsearch to write segments, but I have no idea what. I never saw this problem before.

James_Faulkner · April 21, 2016, 4:01pm

So I still need to do some more testing, but it appears that this was being caused by this issue https://github.com/elastic/elasticsearch/issues/16444. Setting the distance_error_pct value for the geo_shape field appears to have improved performance and stopped the memory issues we were seeing.

Topic		Replies	Views
What's using memory in ElasticSearch? (Details to follow...) Elasticsearch	8	1980	July 6, 2017
Memory problems during data index Elasticsearch	13	1561	July 6, 2017
Am I pushing my cluster's resource limits? (OOME) Elasticsearch	3	408	July 6, 2017
Elasticsearch 2.3.3 encountered outofmemory Elasticsearch	11	1221	July 5, 2017
Simultaneous OutOfMemoryErrors across multiple nodes in cluster Elasticsearch	4	358	July 6, 2017

Out of Memory on 2.3.1 (cluster id 14de95)

Related topics