Nodes crashed and kNN questions

doc113 · January 25, 2021, 11:49am

Good time of day!
Faced with an inexplicable thing - during the kNN search, the servers fall (in different order and not always (e.g. two success searches, one fail etc).

Cluster:
master 4cpu, 4gb
coordinator 8cpu, 8gb (I'm not sure that it plays a role, no load is visible on it)
3x data-node 16cpu, 64gb, SSD
circuit_breaker_limit=85%

Elasticsearch 7.9.1 (installed OpenDistro For Elasticsearch 1.10.1 (used because there was a problem with kNN-score calculations in 1.11.0))

Cluster contents: each index is about 10 GB, there are 22 such indexes (elastic itself moved them through the data-nodes if necessary). Index fields = vector of 128 elements, and 3 fields with additional data. Number of segments in the index = number of shards per index = 1

Nodes crash

Initially, heap put 30gb, then experimentally saw that even with 4gb on primary-shards, the search works properly, 22 indexes are read from disks for 30 seconds, then the search takes milliseconds.

After I enabled 1 replica for all indexes, the search with heap=4gb stopped working at all, with 12gb < heap < 30gb it works every other time. Crashes with the following error (example, nodes may change)

{'error': {'root_cause': , 'type': 'search_phase_execution_exception', 'reason': '', 'phase': 'fetch', 'grouped': True, 'failed_shards': , 'caused_by': {'type': 'node_not_connected_exception', 'reason': '[data-node-1][10.250.7.90:9300] Node not connected'}}, 'status': 500}

I tried reducing circuit_breaker_limit to 50%, but it doesn't have any effect...

If I understand correctly - when reducing circuit_breaker_limit, I should have received a high search time (for example, the same 15-30s), and the indexes should be rotated in memory? Then why do the nodes fall - I put logging.level: DEBUG in the logs and there are no errors, except that a message appears that the next data-node is working again. When i set circuit_breaker_limit=20%:
(64gb - 12gb (jvm)) * 20% = 10.4gb to hold in memory. But in zabbix i have seen growth of memory for 40gb! How is that possible?

Replicas trouble

When i created 1 replica for indices above, i saw in /_opendistro/_knn/stats that replica was also being loaded into memory. At the same time, the search speed remained the same. Are there any ways to solve this problem?

The problem of the balancer (not the most important, but still)

Balancer trying to load the third data-node more than other.
When 1 and 2 data-nodes take 13-14% of disk total space, data-node-3 takes 18%. No settings were changed, I do not understand where this problem comes from.

Vector size (economy space of SSD)

Is it true that each element of the vector needs 4 bytes, whether of the data type or not? it looks like if the vector consists of 512 elements float32, and the vector consisting of 512 elements int8 do not differ in size. Is it possible to reduce this value?

I will be grateful for your help in my questions. I'm ready to provide any logs

P.S. Sorry for my English please

Christian_Dahlqvist · January 25, 2021, 12:35pm

I would recommend you post your question in the OpenDistro forum as it is generally not supported here.

doc113 · January 25, 2021, 1:44pm

I have already done this and am waiting for an answer, here I wrote with the hope of help too

system · February 22, 2021, 1:44pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Nodes crashes, problem with leader check Elasticsearch	2	603	July 21, 2021
Would a node failure take down the cluster? Elasticsearch	3	380	November 6, 2020
Debugging performance decrease after a node fault Elasticsearch	4	634	February 3, 2018
About simple 3 node cluster frecuent crashing Elasticsearch	6	1052	July 5, 2017
Cluster crashed Elasticsearch	9	443	July 6, 2017

Nodes crashed and kNN questions

Related topics