Смена состояние кластера после запроса

nessero · February 28, 2019, 9:49am

Добрый день ! Пожалуйста помогите разобраться, сегодня 2 ноды моего кластера потеряли друг друга и кластер стал не операбелен. Ниже часть лога до и во время события:

[2019-02-28T09:43:52,747][DEBUG][o.e.a.s.TransportSearchAction] [elastic-hot] [graylog_608][3], node[euwPHAKwQLOLNfx0fps02g], [P], s[STARTED], a[id=d8Bhiy6fQeWnz0BTErc__g]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[graylog_608, graylog_609], indicesOptions=IndicesOptions[id=38, ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_alisases_to_multiple_indices=true, forbid_closed_indices=true], types=[message], routing='null', preference='null', requestCache=null, scroll=null, maxConcurrentShardRequests=10, batchedReduceSize=512, preFilterShardSize=64, source={
[2019-02-28T09:43:52,755][DEBUG][o.e.a.s.TransportSearchAction] [elastic-hot] [graylog_609][3], node[euwPHAKwQLOLNfx0fps02g], [P], s[STARTED], a[id=_eLHFO5qSeihsSPWdUL8UA]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[graylog_608, graylog_609], indicesOptions=IndicesOptions[id=38, ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_alisases_to_multiple_indices=true, forbid_closed_indices=true], types=[message], routing='null', preference='null', requestCache=null, scroll=null, maxConcurrentShardRequests=10, batchedReduceSize=512, preFilterShardSize=64, source={
[2019-02-28T09:43:52,757][DEBUG][o.e.a.s.TransportSearchAction] [elastic-hot] All shards failed for phase: [query]
[2019-02-28T09:43:58,825][INFO ][o.e.m.j.JvmGcMonitorService] [elastic-hot] [gc][58996] overhead, spent [251ms] collecting in the last [1s]
[2019-02-28T09:44:27,127][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [elastic-hot] failed to execute on node [vMZI77LoTyuHlEH-xVvJJQ]
[2019-02-28T09:44:43,845][INFO ][o.e.m.j.JvmGcMonitorService] [elastic-hot] [gc][59041] overhead, spent [420ms] collecting in the last [1s]
[2019-02-28T09:45:27,130][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [elastic-hot] failed to execute on node [vMZI77LoTyuHlEH-xVvJJQ]
[2019-02-28T09:45:30,320][INFO ][o.e.c.r.a.AllocationService] [elastic-hot] Cluster health status changed from [GREEN] to [RED] (reason: []).
[2019-02-28T09:45:30,321][INFO ][o.e.c.s.ClusterService   ] [elastic-hot] removed {{elastic-warm}{vMZI77LoTyuHlEH-xVvJJQ}{hIKLM1muRAOjZEbfWjNYaA}{192.168.150.108}{192.168.150.108:9300}{box_type=warm},}, reason: zen-disco-node-failed({elastic-warm}{vMZI77LoTyuHlEH-xVvJJQ}{hIKLM1muRAOjZEbfWjNYaA}{192.168.150.108}{192.168.150.108:9300}{box_type=warm}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)
[2019-02-28T09:45:31,338][DEBUG][o.e.a.a.i.s.TransportIndicesStatsAction] [elastic-hot] failed to execute [indices:monitor/stats] on node [vMZI77LoTyuHlEH-xVvJJQ]
[2019-02-28T09:45:31,339][DEBUG][o.e.a.a.i.s.TransportIndicesStatsAction] [elastic-hot] failed to execute [indices:monitor/stats] on node [vMZI77LoTyuHlEH-xVvJJQ]
[2019-02-28T09:45:31,347][INFO ][o.e.c.r.DelayedAllocationService] [elastic-hot] scheduling reroute for delayed shards in [58.8s] (64 delayed shards)
[2019-02-28T09:45:53,089][INFO ][o.e.c.s.ClusterService   ] [elastic-hot] added {{elastic-warm}{vMZI77LoTyuHlEH-xVvJJQ}{hIKLM1muRAOjZEbfWjNYaA}{192.168.150.108}{192.168.150.108:9300}{box_type=warm},}, reason: zen-disco-node-join
[2019-02-28T09:46:23,109][WARN ][o.e.d.z.PublishClusterStateAction] [elastic-hot] timed out waiting for all nodes to process published state [82] (timeout [30s], pending nodes: [{elastic-warm}{vMZI77LoTyuHlEH-xVvJJQ}{hIKLM1muRAOjZEbfWjNYaA}{192.168.150.108}{192.168.150.108:9300}{box_type=warm}])
[2019-02-28T09:46:23,156][WARN ][o.e.d.z.ElectMasterService] [elastic-hot] value for setting "discovery.zen.minimum_master_nodes" is too low. This can result in data loss! Please set it to at least a quorum of master-eligible nodes (current value: [1], total number of master-eligible nodes used for publishing in this round: [2])
[2019-02-28T09:46:23,157][WARN ][o.e.c.s.ClusterService   ] [elastic-hot] cluster state update task [zen-disco-node-join] took [30s] above the warn threshold of 30s
[2019-02-28T09:46:53,173][WARN ][o.e.d.z.PublishClusterStateAction] [elastic-hot] timed out waiting for all nodes to process published state [83] (timeout [30s], pending nodes: [{elastic-warm}{vMZI77LoTyuHlEH-xVvJJQ}{hIKLM1muRAOjZEbfWjNYaA}{192.168.150.108}{192.168.150.108:9300}{box_type=warm}])
[2019-02-28T09:46:53,253][WARN ][o.e.c.s.ClusterService   ] [elastic-hot] cluster state update task [cluster_reroute(async_shard_fetch)] took [30s] above the warn threshold of 30s
[2019-02-28T09:47:46,578][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [elastic-hot] failed to execute on node [vMZI77LoTyuHlEH-xVvJJQ]
[2019-02-28T09:47:51,214][WARN ][o.e.d.z.PublishClusterStateAction] [elastic-hot] timed out waiting for all nodes to process published state [86] (timeout [30s], pending nodes: [{elastic-warm}{vMZI77LoTyuHlEH-xVvJJQ}{hIKLM1muRAOjZEbfWjNYaA}{192.168.150.108}{192.168.150.108:9300}{box_type=warm}])
[2019-02-28T09:47:51,348][WARN ][o.e.c.s.ClusterService   ] [elastic-hot] cluster state update task [shard-started shard id [[graylog_605][2]], allocation id [GpHfybP4T4O-SdUhuRJSwA], primary term [0], message [after existing recovery], shard-started shard id [[graylog_605][0]], allocation id [PkQ1r-llQcW5KbHChhZvRQ], primary term [0], message [after existing recovery], shard-started shard id [[graylog_605][2]], allocation id [GpHfybP4T4O-SdUhuRJSwA], primary term [0], message [master {elastic-hot}{euwPHAKwQLOLNfx0fps02g}{hlIxMvyMTxO1wrpiZG4ymA}{192.168.150.109}{192.168.150.109:9300}{box_type=hot} marked shard as initializing, but shard state is [POST_RECOVERY], mark shard as started], shard-started shard id [[graylog_605][1]], allocation id [gwhVa9NbQ0eHa9TPpakI1A], primary term [0], message [after existing recovery], shard-started shard id [[graylog_605][1]], allocation id [gwhVa9NbQ0eHa9TPpakI1A], primary term [0], message [master {elastic-hot}{euwPHAKwQLOLNfx0fps02g}{hlIxMvyMTxO1wrpiZG4ymA}{192.168.150.109}{192.168.150.109:9300}{box_type=hot} marked shard as initializing, but shard state is [POST_RECOVERY], mark shard as started], shard-started shard id [[graylog_605][0]], allocation id [PkQ1r-llQcW5KbHChhZvRQ], primary term [0], message [master {elastic-hot}{euwPHAKwQLOLNfx0fps02g}{hlIxMvyMTxO1wrpiZG4ymA}{192.168.150.109}{192.168.150.109:9300}{box_type=hot} marked shard as initializing, but shard state is [POST_RECOVERY], mark shard as started]] took [30.1s] above the warn threshold of 30s

Igor_Motov · February 28, 2019, 3:36pm

[2019-02-28T09:44:43,845][INFO ][o.e.m.j.JvmGcMonitorService] [elastic-hot] [gc][59041] overhead, spent [420ms] collecting in the last [1s]

Скорее всего нехватка памяти. Что было с heap-ом в момент отказа?

nessero · March 1, 2019, 3:57am

Вообщем там 24 гигабайта памяти на сервере, heap не смотрел на момент отказа. Как-то по старому логу его можно найти сейчас ?

По логу нашел только это:

sysadmin@elastic-hot:~$ grep "heap" /var/log/elasticsearch/graylog-2019-02-28.log 
[2019-02-28T10:29:35,576][INFO ][o.e.e.NodeEnvironment    ] [elastic-hot] heap size [11.8gb], compressed ordinary object pointers [true]
[2019-02-28T10:52:46,794][INFO ][o.e.e.NodeEnvironment    ] [elastic-hot] heap size [11.8gb], compressed ordinary object pointers [true]

Igor_Motov · March 1, 2019, 3:20pm

А что в логе следует перед этими строками? Эти строки обычно появляются, когда узел перестартовывает. Причину перестартовки можно часто найти перед информацией о старте узла.

nessero · March 4, 2019, 3:55am

Подскажите а как смотреть сосояние heap в момент отказа вообщем ?

Лог перед перезапуском, перезапуск делался вручную для того чтобы хот нода нашла варм ноду:

[2019-02-28T10:21:54,627][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [elastic-hot] failed to execute on node [vMZI77LoTyuHlEH-xVvJJQ]
[2019-02-28T10:22:53,481][INFO ][o.e.c.s.ClusterService   ] [elastic-hot] removed {{elastic-warm}{vMZI77LoTyuHlEH-xVvJJQ}{hIKLM1muRAOjZEbfWjNYaA}{192.168.150.108}{192.168.150.108:9300}{box_type=warm},}, reason: zen-disco-node-failed({elastic-warm}{vMZI77LoTyuHlEH-xVvJJQ}{hIKLM1muRAOjZEbfWjNYaA}{192.168.150.108}{192.168.150.108:9300}{box_type=warm}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)
[2019-02-28T10:22:54,353][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [elastic-hot] failed to execute on node [vMZI77LoTyuHlEH-xVvJJQ]
[2019-02-28T10:22:54,353][DEBUG][o.e.a.a.i.s.TransportIndicesStatsAction] [elastic-hot] failed to execute [indices:monitor/stats] on node [vMZI77LoTyuHlEH-xVvJJQ]
[2019-02-28T10:22:54,355][DEBUG][o.e.a.a.i.s.TransportIndicesStatsAction] [elastic-hot] failed to execute [indices:monitor/stats] on node [vMZI77LoTyuHlEH-xVvJJQ]
[2019-02-28T10:22:54,356][INFO ][o.e.c.r.DelayedAllocationService] [elastic-hot] scheduling reroute for delayed shards in [59s] (61 delayed shards)
[2019-02-28T10:23:30,131][INFO ][o.e.c.s.ClusterService   ] [elastic-hot] added {{elastic-warm}{vMZI77LoTyuHlEH-xVvJJQ}{hIKLM1muRAOjZEbfWjNYaA}{192.168.150.108}{192.168.150.108:9300}{box_type=warm},}, reason: zen-disco-node-join
[2019-02-28T10:24:00,152][WARN ][o.e.d.z.PublishClusterStateAction] [elastic-hot] timed out waiting for all nodes to process published state [132] (timeout [30s], pending nodes: [{elastic-warm}{vMZI77LoTyuHlEH-xVvJJQ}{hIKLM1muRAOjZEbfWjNYaA}{192.168.150.108}{192.168.150.108:9300}{box_type=warm}])
[2019-02-28T10:24:00,198][WARN ][o.e.d.z.ElectMasterService] [elastic-hot] value for setting "discovery.zen.minimum_master_nodes" is too low. This can result in data loss! Please set it to at least a quorum of master-eligible nodes (current value: [1], total number of master-eligible nodes used for publishing in this round: [2])
[2019-02-28T10:24:00,198][WARN ][o.e.c.s.ClusterService   ] [elastic-hot] cluster state update task [zen-disco-node-join] took [30s] above the warn threshold of 30s
[2019-02-28T10:24:30,202][WARN ][o.e.d.z.PublishClusterStateAction] [elastic-hot] timed out waiting for all nodes to process published state [133] (timeout [30s], pending nodes: [{elastic-warm}{vMZI77LoTyuHlEH-xVvJJQ}{hIKLM1muRAOjZEbfWjNYaA}{192.168.150.108}{192.168.150.108:9300}{box_type=warm}])
[2019-02-28T10:24:30,238][WARN ][o.e.c.s.ClusterService   ] [elastic-hot] cluster state update task [delayed_allocation_reroute] took [30s] above the warn threshold of 30s
[2019-02-28T10:26:10,314][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [elastic-hot] failed to execute on node [vMZI77LoTyuHlEH-xVvJJQ]
[2019-02-28T10:26:14,086][WARN ][o.e.d.z.PublishClusterStateAction] [elastic-hot] timed out waiting for all nodes to process published state [139] (timeout [30s], pending nodes: [{elastic-warm}{vMZI77LoTyuHlEH-xVvJJQ}{hIKLM1muRAOjZEbfWjNYaA}{192.168.150.108}{192.168.150.108:9300}{box_type=warm}])
[2019-02-28T10:26:14,536][WARN ][o.e.c.s.ClusterService   ] [elastic-hot] cluster state update task [shard-started shard id [[graylog_603][1]], allocation id [pSgqhJC_Qk25OJABSt2faw], primary term [0], message [master {elastic-hot}{euwPHAKwQLOLNfx0fps02g}{hlIxMvyMTxO1wrpiZG4ymA}{192.168.150.109}{192.168.150.109:9300}{box_type=hot} marked shard as initializing, but shard state is [POST_RECOVERY], mark shard as started], shard-started shard id [[graylog_603][1]], allocation id [pSgqhJC_Qk25OJABSt2faw], primary term [0], message [after existing recovery]] took [30.4s] above the warn threshold of 30s
[2019-02-28T10:27:10,320][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [elastic-hot] failed to execute on node [vMZI77LoTyuHlEH-xVvJJQ]
[2019-02-28T10:27:14,388][INFO ][o.e.c.s.ClusterService   ] [elastic-hot] removed {{elastic-warm}{vMZI77LoTyuHlEH-xVvJJQ}{hIKLM1muRAOjZEbfWjNYaA}{192.168.150.108}{192.168.150.108:9300}{box_type=warm},}, reason: zen-disco-node-failed({elastic-warm}{vMZI77LoTyuHlEH-xVvJJQ}{hIKLM1muRAOjZEbfWjNYaA}{192.168.150.108}{192.168.150.108:9300}{box_type=warm}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)
[2019-02-28T10:27:14,586][DEBUG][o.e.a.a.i.s.TransportIndicesStatsAction] [elastic-hot] failed to execute [indices:monitor/stats] on node [vMZI77LoTyuHlEH-xVvJJQ]
[2019-02-28T10:27:14,586][DEBUG][o.e.a.a.i.s.TransportIndicesStatsAction] [elastic-hot] failed to execute [indices:monitor/stats] on node [vMZI77LoTyuHlEH-xVvJJQ]
[2019-02-28T10:27:14,586][DEBUG][o.e.a.a.i.s.TransportIndicesStatsAction] [elastic-hot] failed to execute [indices:monitor/stats] on node [vMZI77LoTyuHlEH-xVvJJQ]
[2019-02-28T10:27:14,593][INFO ][o.e.c.r.DelayedAllocationService] [elastic-hot] scheduling reroute for delayed shards in [59.7s] (13 delayed shards)
[2019-02-28T10:27:42,300][INFO ][o.e.c.s.ClusterService   ] [elastic-hot] added {{elastic-warm}{vMZI77LoTyuHlEH-xVvJJQ}{hIKLM1muRAOjZEbfWjNYaA}{192.168.150.108}{192.168.150.108:9300}{box_type=warm},}, reason: zen-disco-node-join
[2019-02-28T10:28:05,445][INFO ][o.e.n.Node               ] [elastic-hot] stopping ...
[2019-02-28T10:28:12,322][WARN ][o.e.d.z.PublishClusterStateAction] [elastic-hot] timed out waiting for all nodes to process published state [142] (timeout [30s], pending nodes: [{elastic-warm}{vMZI77LoTyuHlEH-xVvJJQ}{hIKLM1muRAOjZEbfWjNYaA}{192.168.150.108}{192.168.150.108:9300}{box_type=warm}])
[2019-02-28T10:28:12,354][WARN ][o.e.d.z.ElectMasterService] [elastic-hot] value for setting "discovery.zen.minimum_master_nodes" is too low. This can result in data loss! Please set it to at least a quorum of master-eligible nodes (current value: [1], total number of master-eligible nodes used for publishing in this round: [2])
[2019-02-28T10:28:12,354][WARN ][o.e.c.s.ClusterService   ] [elastic-hot] cluster state update task [zen-disco-node-join] took [30s] above the warn threshold of 30s
[2019-02-28T10:28:12,378][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [elastic-hot] failed to execute on node [vMZI77LoTyuHlEH-xVvJJQ]

nessero · March 4, 2019, 4:01am

Заметил что после запуска появляються строки, что они значат ?

[2019-02-28T10:31:52,569][INFO ][o.e.m.j.JvmGcMonitorService] [elastic-hot] [gc][122] overhead, spent [807ms] collecting in the last [1.7s]
[2019-02-28T10:36:15,737][WARN ][o.e.m.j.JvmGcMonitorService] [elastic-hot] [gc][385] overhead, spent [584ms] collecting in the last [1s]
[2019-02-28T10:43:03,977][INFO ][o.e.m.j.JvmGcMonitorService] [elastic-hot] [gc][793] overhead, spent [358ms] collecting in the last [1s]

Igor_Motov · March 4, 2019, 2:31pm

Эти строки значат, что у вас узел 50% времени проводит в GC?, что, в свою очередь, означает, что этот узел перегружен каким-то образом (может быть нехватка памяти, процессор занят или что-то еще). Надо смотреть на stats.

nessero · March 5, 2019, 4:12am

Сохранил вывод https://pastebin.com/FDzXTg8b

Igor_Motov · March 6, 2019, 3:07pm

Ничего подозрительного не вижу кроме большого swap-а. На нодах проблемы были, когда вы этот stats запустили?

nessero · March 7, 2019, 3:20am

Не было, добавил в конфигурацию кое что, как я понял в следующий раз поможет больше информации получить.

sysadmin@elastic-hot:~$ curl -X GET "localhost:9200/_cluster/settings"
{"persistent":{"logger":{"org":{"elasticsearch":{"discovery":{"zen":{"MasterFaultDetection":"TRACE","NodesFaultDetection":"TRACE"}}}}}},"transient":{"cluster":{"routing":{"allocation":{"disk":{"watermark":{"low":"90%"}}}},"info":{"update":{"interval":"1m"}}}}}

Igor_Motov · March 7, 2019, 1:54pm

Было бы здорово, если бы вы добавили еще один маленький узел с отдельным кластером на нем и свалили бы туда данные мониторинга с основного кластера. Тогда сразу было бы понятно, что происходило перед тем как узел отвалился.

nessero · March 11, 2019, 5:09am

Поднять еще одну виртуальную машину с еластиком, не входящую в этот кластер ? Простите, я не очень понял что нужно мне сделать. У меня подняты еще 4 ноды для этого кластера, 2 варма и 2 хота. Пока не вводил их в бой. Так как нужен полный рестарт кластера.

Igor_Motov · March 11, 2019, 2:18pm

Я имел в виду налаживание мониторинга кластера, что бы можно было понять, что с кластером происходит перед тем, как он падает

nessero · March 12, 2019, 3:25am

Хорошо, спасибо, пойду почитаю.

nessero · March 12, 2019, 3:36am

Уточнение, как можно "свалить данные" мониторинга когда я подниму кибану, у нас кстати она была, стояла на основном кластере.

Igor_Motov · March 12, 2019, 2:27pm

Посмотрите https://www.elastic.co/guide/en/elastic-stack-overview/6.6/monitoring-production.html и https://www.elastic.co/guide/en/elasticsearch/reference/6.6/collecting-monitoring-data.html

nessero · March 13, 2019, 5:12am

узел мониторинга может быть одной нодой ? То есть мне нужно еще одну виртуальную машину, там еластик поднять и через кибану отправлять туда данные мониторинга, верно ? Еластик у меня версии 5.6.11, обновление планирую, но позже. Возможно всю эту схему с мониторингом на 5.6.11 использовать ?

Igor_Motov · March 13, 2019, 2:04pm

Да, одной больше чем достаточно

Нет, кибаной вы будете смотреть результаты. Данные будет отправлять экспортер мониторинга, который будет работать в самом elasticsearch.

Да хоть на 2.x - эта схема уже давно существует. Только документация немного по другому отформатирована Monitoring Elasticsearch | X-Pack for the Elastic Stack [5.6] | Elastic

system · April 10, 2019, 2:04pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
7.3.2 и потеря мастера Вопросы на русском языке	73	2639	November 26, 2019
Проблемы производительности кластера. Оптимизация Вопросы на русском языке	14	5750	July 4, 2018
Cascading cluster failure Elasticsearch	13	508	July 6, 2017
When one node goes down, memory usage jumps several gigabytes on other nodes Elasticsearch	7	561	July 6, 2017
Master node failure causes cluster to fail Elasticsearch	3	1674	July 6, 2017

Смена состояние кластера после запроса

Related topics