Out of Memory on 2.3.1 (cluster id 14de95)

Hi,

I am currently in the process of trying to upgrade our Elastic Cloud clusters from 1.7.3 to 2.3.1.

For testing I have created a new cluster (ID is 14de95) which is running 2.3.1. This was setup to have the same settings and mappings as our current non PROD cluster (ID is bf0245).

However, I have been having some issues whilst indexing all of our data into the 2.3.1 instance. From my application, whenever I trigger all the of the data to be indexed into Elasticsearch 2.3.1 the cluster runs out of memory, normally after about 1 hour of indexing. This process works fine when indexing against 1.7.3. I have tried increasing the RAM available for the cluster (from 4GB to 8 GB) and increasing the refresh_interval from the default value to 30s. Neither of these things seemed to help. (4GB and the default refresh_interval works without any noticeable issues on our 1.7.3 cluster).

Monitoring the node using the stats API, I did notice that the segments.index_writer_memory value seemed to be increasing in size more than I'd expect. During a complete index I saw this value reach 2.2GB using Elasticsearch 2.3.1 just before the OutOfMemoryError. I've included the details below. Using Elasticsearch 1.7.3 I never saw this value go above 11Mb.

    "segments" : {
      "count" : 30,
      "memory" : "17.4mb",
      "memory_in_bytes" : 18259186,
      "terms_memory" : "17.1mb",
      "terms_memory_in_bytes" : 17987986,
      "stored_fields_memory" : "35kb",
      "stored_fields_memory_in_bytes" : 35896,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "63.8kb",
      "norms_memory_in_bytes" : 65408,
      "doc_values_memory" : "165.9kb",
      "doc_values_memory_in_bytes" : 169896,
      "index_writer_memory" : "2.2gb",
      "index_writer_memory_in_bytes" : 2419813928,
      "index_writer_max_memory" : "395.9mb",
      "index_writer_max_memory_in_bytes" : 415183667,
      "version_map_memory" : "91.2kb",
      "version_map_memory_in_bytes" : 93477,
      "fixed_bit_set" : "13kb",
      "fixed_bit_set_memory_in_bytes" : 13336
    },

I'd have thought that this value would decrease automatically with Elasticsearch calling flush without any intervention from me.

I'd be very grateful if someone could point me towards what could be causing this?

Thanks,
James

Hi @James_Faulkner,

I've moved the post to the elasticsearch category, as I think you'll get better help here. It looks like the issue is elasticsearch 2.3.1 related.

Thanks,
Igor

It is unexpected that segments.index_writer_memory reaches such a high value with the default configuration. Do you happen to have non-default values for some settings? Anything particular with your setup, many indices maybe?

Hi @jpountz

Thanks for the response.

There's nothing particularly strange about our settings...

{
  "myindex-dev" : {
    "settings" : {
      "index" : {
        "refresh_interval" : "30s",
        "number_of_shards" : "1",
        "creation_date" : "1460718140781",
        "analysis" : {
          "filter" : {
            "meal_synonyms" : {
              "type" : "synonym",
              "synonyms" : [ "prawn,shrimp", "eggplant,aubergine", "fries,chips", "zucchini,courgette" ]
            },
            "no_stem" : {
              "keywords" : [ "fries" ],
              "type" : "keyword_marker"
            }
          },
          "analyzer" : {
            "string_lowercase" : {
              "filter" : "lowercase",
              "tokenizer" : "keyword"
            },
            "default" : {
              "filter" : [ "lowercase", "stop", "meal_synonyms", "no_stem", "kstem" ],
              "type" : "custom",
              "tokenizer" : "whitespace"
            }
          }
        },
        "number_of_replicas" : "1",
        "uuid" : "yr_YaiAITEi-nZmGdZsPoA",
        "version" : {
          "created" : "2030199"
        }
      }
    }
  }
}

Other than the refresh_interval being increased to 30s, these are the same settings that we are using on 1.7.3 without any issues.

Cheers,
James

OK, so you did not touch the node level setting indices.memory.index_buffer_size or indices.memory.min_index_buffer_size? By default segments.index_writer_memory should never be larger than 10% of the JVM heap, which is ~410MB on a 4GB heap.

Do you have any warnings in the logs?

There are a lot of warnings coming from monitor.jvm like this

[2016-04-15 12:47:12,694][WARN ][monitor.jvm ] [gc][young][25][2] duration [1s], collections [1]/[2.1s], total [1s]/[1.3s], memory [1gb]->[123.8mb]/[3.8gb], all_pools {[young] [1gb]->[40.4mb]/[1gb]}{[survivor] [47.8mb]->[83.4mb]/[136.5mb]}{[old] [0b]->[0b]/[2.6gb]}

There are also a number of timeouts like

[2016-04-15 12:48:01,989][WARN ][rest.suppressed ] /myindex-dev/mealmetadata/6019034233438208/_update Params: {index=myindex-dev, id=6019034233438208, type=meal} UnavailableShardsException[[marketplace-dev][0] [1] shardIt, [0] active : Timeout waiting for [1m], request: indices:data/write/update]
at org.elasticsearch.action.support.single.instance.TransportInstanceSingleOperationAction$AsyncSingleAction.retry(TransportInstanceSingleOperationAction.java:213)
at org.elasticsearch.action.support.single.instance.TransportInstanceSingleOperationAction$AsyncSingleAction.doStart(TransportInstanceSingleOperationAction.java:168)
at org.elasticsearch.action.support.single.instance.TransportInstanceSingleOperationAction$AsyncSingleAction$2.onTimeout(TransportInstanceSingleOperationAction.java:234)
at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:236)
at org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:804)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

However, these all appear to be happening a while after the full index process has started. I haven't actually spotted warnings in the logs before the indexing has kicked in.

@jpountz

I attempted this again on a new cluster this morning and noticed these warnings, though I'm not sure that they it is related.

[2016-04-19 08:38:49,956][WARN ][com.amazonaws.jmx.SdkMBeanRegistrySupport] java.security.AccessControlException: access denied ("javax.management.MBeanServerPermission" "findMBeanServer")
at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)
at java.security.AccessController.checkPermission(AccessController.java:884)
at java.lang.SecurityManager.checkPermission(SecurityManager.java:549)
at javax.management.MBeanServerFactory.checkPermission(MBeanServerFactory.java:413)
at javax.management.MBeanServerFactory.findMBeanServer(MBeanServerFactory.java:361)
at com.amazonaws.jmx.MBeans.getMBeanServer(MBeans.java:111)
at com.amazonaws.jmx.MBeans.registerMBean(MBeans.java:50)
at com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27)
at com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:355)
at com.amazonaws.metrics.AwsSdkMetrics.(AwsSdkMetrics.java:316)
at com.amazonaws.services.s3.AmazonS3Client.(AmazonS3Client.java:282)
at org.elasticsearch.cloud.aws.InternalAwsS3Service.getClient(InternalAwsS3Service.java:155)
at org.elasticsearch.cloud.aws.InternalAwsS3Service.client(InternalAwsS3Service.java:81)
at org.elasticsearch.repositories.s3.S3Repository.(S3Repository.java:131)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.elasticsearch.common.inject.DefaultConstructionProxyFactory$1.newInstance(DefaultConstructionProxyFactory.java:50)
at org.elasticsearch.common.inject.ConstructorInjector.construct(ConstructorInjector.java:86)
at org.elasticsearch.common.inject.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:104)
at org.elasticsearch.common.inject.FactoryProxy.get(FactoryProxy.java:54)
at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:47)
at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:887)
at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:43)
at org.elasticsearch.common.inject.Scopes$1$1.get(Scopes.java:59)
at org.elasticsearch.common.inject.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:46)
at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:201)
at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:193)
at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:880)
at org.elasticsearch.common.inject.InjectorBuilder.loadEagerSingletons(InjectorBuilder.java:193)
at org.elasticsearch.common.inject.InjectorBuilder.injectDynamically(InjectorBuilder.java:175)
at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:110)
at org.elasticsearch.common.inject.InjectorImpl.createChildInjector(InjectorImpl.java:162)
at org.elasticsearch.common.inject.ModulesBuilder.createChildInjector(ModulesBuilder.java:55)
at org.elasticsearch.repositories.RepositoriesService.createRepositoryHolder(RepositoriesService.java:404)
at org.elasticsearch.repositories.RepositoriesService.registerRepository(RepositoriesService.java:368)
at org.elasticsearch.repositories.RepositoriesService.access$100(RepositoriesService.java:55)
at org.elasticsearch.repositories.RepositoriesService$1.execute(RepositoriesService.java:110)
at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:45)
at org.elasticsearch.cluster.service.InternalClusterService.runTasksForExecutor(InternalClusterService.java:468)
at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:772)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

I also tried calling the flush API directly, but that always results in the following...

{
  "_shards" : {
    "total" : 2,
    "successful" : 0,
    "failed" : 2,
    "failures" : [ {
      "shard" : 0,
      "index" : "myindex-dev",
      "status" : "SERVICE_UNAVAILABLE",
      "reason" : {
        "type" : "flush_not_allowed_engine_exception",
        "reason" : "already flushing...",
        "shard" : "0",
        "index" : "myindex-dev"
      }
    } ]
  }
}

I attempted this again on a new cluster this morning and noticed these warnings, though I'm not sure that they it is related.

It's not.

About this message, did you change logging.yml?
We by default hide this.

Hey @dadoonet

I haven't touched logging.yml. Our cluster is deployed on Elastic Cloud so is using whatever the default logging.yaml file is for that.

Interesting. Thanks for the information. It helps.

It looks like something is preventing elasticsearch to write segments, but I have no idea what. I never saw this problem before.

So I still need to do some more testing, but it appears that this was being caused by this issue https://github.com/elastic/elasticsearch/issues/16444. Setting the distance_error_pct value for the geo_shape field appears to have improved performance and stopped the memory issues we were seeing.