Shard's primary and replica documents inconsistency

Elasticsearch version: 2.4.0
Plugins installed: cloud-aws, hq, kopf, reindexing, whatson
JVM version: 1.8.0_92
OS version: Amazon Linux

Elasticsearch setup:
24 nodes with -Xms10529m -Xmx10529m for elasticsearch on each
360 GB dedicated volume for elasticsearch data on each node
elasticsearch.yml:

bootstrap.mlockall: true
cloud.aws.access_key: xxx
cloud.aws.region: us-east-1
cloud.aws.secret_key: xxx
cloud.node.auto_attributes: true
cluster.name: es-production-cluster
discovery.ec2.groups: sg-0123456
discovery.ec2.host_type: private_ip
discovery.type: ec2
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.multicast.enabled: false
gateway.expected_nodes: 2
gateway.recover_after_nodes: 1
gateway.recover_after_time: 5m
index.search.slowlog.threshold.fetch.info: 800ms
index.search.slowlog.threshold.fetch.trace: 0ms
index.search.slowlog.threshold.fetch.warn: 1s
index.search.slowlog.threshold.query.info: 300ms
index.search.slowlog.threshold.query.trace: 0ms
index.search.slowlog.threshold.query.warn: 500ms
index.translog.flush_threshold_size: 500mb
indices.memory.index_buffer_size: 1504m
indices.memory.max_shard_index_buffer_size: 1504m
indices.memory.min_shard_index_buffer_size: 1504m
network.host: 0.0.0.0
network.publish_host: 172.1.2.3
node.data: true
node.master: true
node.max_local_storage_nodes: 1
node.name: es-production-node1
script.indexed: true
script.inline: true
threadpool.bulk.queue_size: 150
path.conf: /etc/elasticsearch/01
path.data: /opt/elasticsearch/data/01
path.work: /tmp/elasticsearch/01
path.logs: /var/log/elasticsearch/01
path.plugins: /usr/share/elasticsearch/plugins/01

Index settings:

{
  "strings" : {
    "settings" : {
        "mapping" : {
          "nested_fields" : {
            "limit" : "250"
          }
        },
        "refresh_interval" : "1s",
        "number_of_shards" : "48",
        "creation_date" : "1475183908402",
        "analysis" : {
          "analyzer" : {
            "custom_keywords_only" : {
              "filter" : "lowercase",
              "tokenizer" : "keyword"
            }
          }
        },
        "number_of_replicas" : "1",
        "version" : {
          "created" : "2040099"
        },
        "policy" : {
          "max_merge_at_once" : "2"
        }
      }
    }
  }
}

We faced the problem with primary/replica shard inconsistency that correlates with node(s) rolling restart. In other words, when we restart one of production nodes, number of documents in primary and replica for one (or more) shards differs e.g. by 100 documents. We use _cat/shards to see difference between primary and replica for shards. This difference doesn't recovers itself until we re-index those documents.

Using _preference query parameter we figured out that mentioned inconsistency implies that documents are present in primary but are missing in replica. Moreover, according to our logs and created field in the documents, it’s clear that mentioned documents were created before restart of the node. So replica rolled back somehow to past and this triggered inconsistency after node restart.

We also noticed that arise of documents missing in replica and present in primary, correlates with spike of number of failed indexing operations (we have both metrics in our monitoring system). At the same time our application that uses bulk requests to index data doesn’t have anything in its logs despite it’s logic, according to which it examines response from elasticsearch and throws error into logs in case found non empty failed counter in response. Our assumption is that those failed indexing operations are caused by translog replay that happens every time when shards is recovered.

Our process of restarting node is as follows:

Disable shards allocation in cluster.
Restart elasticsearch node, one at a time.
Enable shards allocation.
Wait until shards are recovered, relocated and cluster status becomes green.
Our traffic pattern includes constant indexing and search traffic sent to elasticsearch cluster during nodes restart.

Please kindly share your thoughts, ideas why inconsistency may happen after node restart and possible ways how to avoid it but still be able to maintain cluster nodes (they needs to be restarted from time to time e.g. to upgrade OS version etc.). Upgrade to elasticsearch 6.0 currently is not possible for multiple reasons. In case you need more information e.g. logs -- please let me know.

Thank you!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.