We have been running Elasticsearch clusters for a number of years, and they have provided sterling service. These clusters are used for time series data; for each cluster around 30 new indices are created daily at the witching hour (00:00:00); additional indices are created throughout the day. Index names are in the form _YYYY.MM.DD. Data is ingested via multiple logstash instances, mostly one logstash instance per index.
By way of background, our cluster configuration is as follows:
- Hot/warm architecture (i.e. index on hot nodes only). Indices are migrated to warm nodes via curator jobs.
- Replication set to 1, Shard count set to 5.
- Linux host operating system running on bare metal
- Elasticsearch v5.5.1
- OpenJDK 1.8.0_101
- Multiple Logstash instances - via the bulk API - are used to deliver documents to Elasticsearch
- Four dedicated master eligible nodes
- 10 (hot) data nodes, each running on dedicated hardware (storage at around 73% used)
- 16 (warm) archive nodes, distributed across four dedicated hosts (storage at around 45% used)
Following a recent upgrade from Elasticsearch v5.2.0 to v5.5.1 inconsistencies were observed between the primary and replica document counts. These inconsistencies have been seen every day since the upgrade, and across a number of different newly created indices; older indices are not written to.
Elasticsearch artifacts created within a typical (post v5.5.1 upgrade) day are:
New Indices 62
New Primary shards 214
New Replica shards 214
New Primary Documents 1,442,877,866
New Replica Documents 1,442,877,369
For this typical day, a total of 497 documents were not replicated. This was across nine indices.
The differences were determined by looking at /_cat/shards and confirmed by searching with preference=_primary and preference=_replica.
Primary and replica document counts for historical indices were also compared; no similar anomalies prior to the v5.5.1 upgrade were observed.
The missing replica documents were mostly timestamped between 00:00:00 and 00:59:59 when the cluster is busily creating indices. It is worth noting that the missing documents aren't always in a contiguous chunk, e.g. it isn't the first N documents of the day, although this might be attributed to order in which the documents are ingested.
In an attempt to understand the issue the indices were "primed" serially prior to their automated creation at 00:00:00, i.e.
$ curl -XPUT -s ... "http://localhost:9200/index_E_2017.10.10?pretty" -d '{}'
which was run at around 2017-10-09 14:00.
Priming the indices improved the situation slightly - missing documents in replica were only seen to affect three indices (down from nine).
Our best guess is that a regression was introduced between v5.2.0 and v5.5.1 which has caused index creation to become more expensive, which then leads to replicas taking longer to come on-line, which results in missing documents in the replicas.
Having read the documentation related to indices creation it becomes evident that - by design - document replication is not synchronous, and if documents are not replicated there is no remediation process to ensure consistency. Indeed there appear to be no warnings or errors, other than the response to the bulk API write.
Consistency can be restored by dropping and recreating the replica; however, if primary and replica shard have exchanged roles due to an external factor (e.g. node failure) then there would be data loss - the replica to be dropped would contain documents not found in the new primary.
The most worrying points of concern are:
- there are documents are missing from replicas, this shouldn't happen in an otherwise healthy cluster
- no warning or error about the failure to write to replica shards - that we can see (other than the response to the bulk write lost to logstash).
- no means to reconcile divergent primary and replica shards, other than dropping and recreating the replicas
- queries are now non-deterministic - dependent on whether the query hits the primary or replica shards
- we must introduce external actions to keep data consistent:
- to minimize the impact indices must be primed, or pre-create, before they would have been created by the cluster
- "reconcile" the replica shards by dropping and recreating them, running with decreased redundancy while doing so.
Elasticsearch has provided a fine service for us until now, and this issue has certainly left a sour taste. Most frustratingly this only came to our attention by accident, we would otherwise be blissfully unaware that there were documents missing from the replicas; non-deterministic behaviour from queries and the potential for data loss - in the event of replica promotion.
Is index creation expected to be slower in v5.5.1+?
Is replication always just a "best efforts" endeavor? Is it something that we should not rely on?
It is self-evident that we have a lot of indices and shards, however the change in behaviour between v5.2.0 and v5.5.1 remains unexpected. I suspect that we may not be alone in experiencing this or a similar issues.