Suddenly the count operation on an index returned a few million less results than usual.
Looking at application logs I see stacktraces when trying to add a new document.
File "/myapp/MetaControl/MetaController.py", line 442, in save_av_analysis
self.av_db.index(index="av_index_two", doc_type="document", id=file_id, body={"av": unique_av_scans_vect}, request_timeout=3600)
File "/usr/local/lib/python2.7/site-packages/elasticsearch/client/utils.py", line 76, in _wrapped
return func(*args, params=params, **kwargs)
File "/usr/local/lib/python2.7/site-packages/elasticsearch/client/__init__.py", line 319, in index
_make_path(index, doc_type, id), params=params, body=body)
File "/usr/local/lib/python2.7/site-packages/elasticsearch/transport.py", line 318, in perform_request
status, headers_response, data = connection.perform_request(method, url, params, body, headers=headers, ignore=ignore, timeout=timeout)
File "/usr/local/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 186, in perform_request
self._raise_error(response.status, raw_data)
File "/usr/local/lib/python2.7/site-packages/elasticsearch/connection/base.py", line 125, in _raise_error
raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
2020-05-18T14:15:12+00:00 codexserver.local worker_private_vt[956]: TransportError: TransportError(503, u'unavailable_shards_exception', u'[av_index_two][1] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[av_index_two][1]] containing [index {[av_index_two][document][REDACTED], source[{"av":["REDACTED2 (score:9)"]}]}]]')
Looking at the elasticsearch logs:
[2020-05-18T14:22:16,914][WARN ][r.suppressed ] path: /av_index_two/document/e1ec61a272e7489dc80741efd5a8c5672bc34dc1, params: {index=av_index_two, id=e1ec61a272e7489dc80741efd5a8c5672bc34dc1, type=document}
org.elasticsearch.action.UnavailableShardsException: [av_index_two][1] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[av_index_two][1]] containing [index {[av_index_two][document][REDACTED], source[{"av":["REDACTEDstrings"]}]}]]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryBecauseUnavailable(TransportReplicationAction.java:862) [elasticsearch-5.3.2.jar:5.3.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryIfUnavailable(TransportReplicationAction.java:699) [elasticsearch-5.3.2.jar:5.3.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.doRun(TransportReplicationAction.java:653) [elasticsearch-5.3.2.jar:5.3.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.3.2.jar:5.3.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onTimeout(TransportReplicationAction.java:816) [elasticsearch-5.3.2.jar:5.3.2]
at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:311) [elasticsearch-5.3.2.jar:5.3.2]
at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:238) [elasticsearch-5.3.2.jar:5.3.2]
at org.elasticsearch.cluster.service.ClusterService$NotifyTimeout.run(ClusterService.java:1162) [elasticsearch-5.3.2.jar:5.3.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.3.2.jar:5.3.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
The explain allocation output
curl http://127.0.0.1:9200/_cluster/allocation/explain?pretty
{
"index" : "av_index",
"shard" : 2,
"primary" : false,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2020-02-20T22:47:21.101Z",
"last_allocation_status" : "no_attempt"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
"node_allocation_decisions" : [
{
"node_id" : "SaAmCeedRJCNPYkwpzQxog",
"node_name" : "SaAmCee",
"transport_address" : "127.0.0.1:9300",
"node_decision" : "no",
"deciders" : [
{
"decider" : "same_shard",
"decision" : "NO",
"explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[av_index][2], node[SaAmCeedRJCNPYkwpzQxog], [P], s[STARTED], a[id=hvRyVAD9QfatQKqs6exQLg]]"
}
]
}
]
}
The elastic container has been up for 2 months. I haven't restarted it because it takes around 30 minutes to start and I'm not sure it will be for worst.
`54ea98e8e860 elasticsearch:codex "/docker-entrypoint.?" 8 months ago Up 2 months 0.0.0.0:9200->9200/tcp, 0.0.0.0:9300->9300/tcp elasticdb_elasticsearch_1`
Although the node is somewhat big, there still space left on the drive.
# du -h -d 2
774G ./nodes/0
774G ./nodes
774G .
drive is not full:
# df -h | grep " /$"
/dev/md0 15T 12T 2.0T 86% /
cluster health:
# curl http://127.0.0.1:9200/_cluster/health?pretty
{
"cluster_name" : "elasticsearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 10,
"active_shards" : 10,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 12,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 45.45454545454545
}
version:
curl http://127.0.0.1:9200
{
"name" : "SaAmCee",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "0CwCUa6BRlifbsQClTH51w",
"version" : {
"number" : "5.3.2",
"build_hash" : "3068195",
"build_date" : "2017-04-24T16:15:59.481Z",
"build_snapshot" : false,
"lucene_version" : "6.4.2"
},
"tagline" : "You Know, for Search"
}
indices:
# curl http://127.0.0.1:9200/_cat/indices?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open .kibana pfJfAgBxTBGfzTLwyeS8eQ 1 1 1 0 3.1kb 3.1kb
red open av_index_two VWaMCx4vRkqY0gBFunvaNw 5 1 31213524 710773 617.2gb 617.2gb
yellow open av_index a7Id1EeuQ_-3IMvPG15UQA 5 1 1866655 1850 1.4gb 1.4gb
(There hasn't been any config changes in the last few hours when this started happening)
$ curl http://127.0.0.1:9200/_cat/shards?v
index shard prirep state docs store ip node
av_index 2 p STARTED 372622 292.9mb 127.0.0.1 SaAmCee
av_index 2 r UNASSIGNED
av_index 3 p STARTED 373382 291.3mb 127.0.0.1 SaAmCee
av_index 3 r UNASSIGNED
av_index 1 p STARTED 373211 304mb 127.0.0.1 SaAmCee
av_index 1 r UNASSIGNED
av_index 4 p STARTED 373965 303.3mb 127.0.0.1 SaAmCee
av_index 4 r UNASSIGNED
av_index 0 p STARTED 373475 303.3mb 127.0.0.1 SaAmCee
av_index 0 r UNASSIGNED
.kibana 0 p STARTED 1 3.1kb 127.0.0.1 SaAmCee
.kibana 0 r UNASSIGNED
av_index_two 2 p STARTED 7801315 154gb 127.0.0.1 SaAmCee
av_index_two 2 r UNASSIGNED
av_index_two 3 p STARTED 7806033 154.4gb 127.0.0.1 SaAmCee
av_index_two 3 r UNASSIGNED
av_index_two 4 p STARTED 7804106 154.5gb 127.0.0.1 SaAmCee
av_index_two 4 r UNASSIGNED
av_index_two 1 p UNASSIGNED
av_index_two 1 r UNASSIGNED
av_index_two 0 p STARTED 7804052 154.2gb 127.0.0.1 SaAmCee
av_index_two 0 r UNASSIGNED
java -version
openjdk version "1.8.0_121"
OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-1~bpo8+1-b13)
OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)
I have found other topics with the same error without reply: