Nodes crashed and kNN questions

Good time of day!
Faced with an inexplicable thing - during the kNN search, the servers fall (in different order and not always (e.g. two success searches, one fail etc).

master 4cpu, 4gb
coordinator 8cpu, 8gb (I'm not sure that it plays a role, no load is visible on it)
3x data-node 16cpu, 64gb, SSD

Elasticsearch 7.9.1 (installed OpenDistro For Elasticsearch 1.10.1 (used because there was a problem with kNN-score calculations in 1.11.0))

Cluster contents: each index is about 10 GB, there are 22 such indexes (elastic itself moved them through the data-nodes if necessary). Index fields = vector of 128 elements, and 3 fields with additional data. Number of segments in the index = number of shards per index = 1

  1. Nodes crash

Initially, heap put 30gb, then experimentally saw that even with 4gb on primary-shards, the search works properly, 22 indexes are read from disks for 30 seconds, then the search takes milliseconds.

After I enabled 1 replica for all indexes, the search with heap=4gb stopped working at all, with 12gb < heap < 30gb it works every other time. Crashes with the following error (example, nodes may change)

{'error': {'root_cause': , 'type': 'search_phase_execution_exception', 'reason': '', 'phase': 'fetch', 'grouped': True, 'failed_shards': , 'caused_by': {'type': 'node_not_connected_exception', 'reason': '[data-node-1][] Node not connected'}}, 'status': 500}

I tried reducing circuit_breaker_limit to 50%, but it doesn't have any effect...

If I understand correctly - when reducing circuit_breaker_limit, I should have received a high search time (for example, the same 15-30s), and the indexes should be rotated in memory? Then why do the nodes fall - I put logging.level: DEBUG in the logs and there are no errors, except that a message appears that the next data-node is working again. When i set circuit_breaker_limit=20%:
(64gb - 12gb (jvm)) * 20% = 10.4gb to hold in memory. But in zabbix i have seen growth of memory for 40gb! How is that possible?

  1. Replicas trouble

When i created 1 replica for indices above, i saw in /_opendistro/_knn/stats that replica was also being loaded into memory. At the same time, the search speed remained the same. Are there any ways to solve this problem?

  1. The problem of the balancer (not the most important, but still)

Balancer trying to load the third data-node more than other.
When 1 and 2 data-nodes take 13-14% of disk total space, data-node-3 takes 18%. No settings were changed, I do not understand where this problem comes from.

  1. Vector size (economy space of SSD)

Is it true that each element of the vector needs 4 bytes, whether of the data type or not? it looks like if the vector consists of 512 elements float32, and the vector consisting of 512 elements int8 do not differ in size. Is it possible to reduce this value?

I will be grateful for your help in my questions. I'm ready to provide any logs

P.S. Sorry for my English please

I would recommend you post your question in the OpenDistro forum as it is generally not supported here.

I have already done this and am waiting for an answer, here I wrote with the hope of help too

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.