Error: Request Timeout after 30000ms - número excesivo de fragmentos

Josmell_Chavarri · February 6, 2018, 3:43pm

Hola

Estoy utilizando una implementación de la herramienta Wazuh 2.0 https://documentation.wazuh.com/2.0/getting-started/index.html, en una instalación agente-Servidor tengo instalado 31 agentes y mi servidor cuenta con una instalación de ELK para el manejo de los registros que envíen los agentes.

Desde hace unas semanas estoy teniendo el siguiente error en el kibana:

Error: Request Timeout after 30000ms
ErrorAbstract@http://xxxxxxxx:5601/bundles/kibana.bundle.js?v=14849:12:24939
StatusCodeError@http://xxxxxxxx5601/bundles/kibana.bundle.js?v=14849:12:28395
Transport.prototype.request/requestTimeoutId<@http://xxxxxxxxx:5601/bundles/kibana.bundle.js?v=14849:13:4431
Transport.prototype._timeout/id<@http://xxxxxxxxx:5601/bundles/kibana.bundle.js?v=14849:13:4852

Y también este error :

Error: in cell #1: [illegal_argument_exception] Trying to query 1401 shards, which is over the limit of 1000. This limit exists because querying many shards at the same time can make the job of the coordinating node very CPU and/or memory intensive. It is usually a better idea to have a smaller number of larger shards. Update [action.search.shard_count.limit] to a greater value if you really want to query that many shards at the same time.
at throwWithCell (/usr/share/kibana/src/core_plugins/timelion/server/handlers/chain_runner.js:30:11)
at /usr/share/kibana/src/core_plugins/timelion/server/handlers/chain_runner.js:160:13
at arrayEach (/usr/share/kibana/node_modules/lodash/index.js:1289:13)
at Function. (/usr/share/kibana/node_modules/lodash/index.js:3345:13)
at /usr/share/kibana/src/core_plugins/timelion/server/handlers/chain_runner.js:152:9
at bound (domain.js:280:14)
at runBound (domain.js:293:12)
at tryCatcher (/usr/share/kibana/node_modules/bluebird/js/main/util.js:26:23)
at Promise._settlePromiseFromHandler (/usr/share/kibana/node_modules/bluebird/js/main/promise.js:503:31)
at Promise._settlePromiseAt (/usr/share/kibana/node_modules/bluebird/js/main/promise.js:577:18)
at Promise._settlePromises (/usr/share/kibana/node_modules/bluebird/js/main/promise.js:693:14)
at Async._drainQueue (/usr/share/kibana/node_modules/bluebird/js/main/async.js:123:16)
at Async._drainQueues (/usr/share/kibana/node_modules/bluebird/js/main/async.js:133:10)
at Immediate.Async.drainQueues (/usr/share/kibana/node_modules/bluebird/js/main/async.js:15:14)
at runCallback (timers.js:666:20)
at tryOnImmediate (timers.js:639:5)

*********Algunos datos del servidor:

[root@xxxxxxx ~]# df -h
S.ficheros Tamaño Usados Disp Uso% Montado en
/dev/mapper/cl-root 97G 40G 58G 42% /
devtmpfs 3,9G 0 3,9G 0% /dev
tmpfs 3,9G 0 3,9G 0% /dev/shm
tmpfs 3,9G 17M 3,8G 1% /run
tmpfs 3,9G 0 3,9G 0% /sys/fs/cgroup
/dev/sda1 1014M 230M 785M 23% /boot
tmpfs 782M 0 782M 0% /run/user/0

[root@xxxxxxxx ~]# free
total used free shared buff/cache available
Mem: 8002828 2283280 132632 16928 5586916 5312056
Swap: 2097148 0 2097148

************Versiones utilizadas

CentOS 7
Elasticsearch 5.5.0
Logstash 5.5.0
Kibana 5.5.0

******Elasticsearch:

GET Health Status ---------------------------

[root@xxxxxxx ~]# curl 'localhost:9200/_cat/health?v'
epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1517923628 07:27:08 wazuh yellow 1 1 1411 1411 0 0 10 0 - 99.3%

GET Nodes -----------------------------------

[root@xxxxxxx ~]# curl 'localhost:9200/_cat/nodes?v'
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
127.0.0.1 94 98 23 0.55 0.85 0.92 mdi * node-1

GET cluster health --------------------------

[root@xxxxxxx~]# curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
"cluster_name" : "wazuh",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 1411,
"active_shards" : 1411,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 10,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 99.29627023223082

GET cluster health indices --------------------------

[root@xxxxxx ~]# curl -XGET 'http://localhost:9200/_cluster/health?level=indices?pretty=true'

{
"cluster_name":"wazuh",
"status":"yellow",
"timed_out":false,
"number_of_nodes":1,
"number_of_data_nodes":1,
"active_primary_shards":1411,
"active_shards":1411,
"relocating_shards":0,
"initializing_shards":0,
"unassigned_shards":10,
"delayed_unassigned_shards":0,
"number_of_pending_tasks":0,
"number_of_in_flight_fetch":0,
"task_max_waiting_in_queue_millis":0,
"active_shards_percent_as_number":9

GET cluster health shards --------------------------

[root@xxxxxxxx ~]# curl -XGET 'http://localhost:9200/_cluster/health?level=shards?pretty=true'

{
"cluster_name":"wazuh",
"status":"yellow",
"timed_out":false,
"number_of_nodes":1,
"number_of_data_nodes":1,
"active_primary_shards":1411,
"active_shards":1411,
"relocating_shards":0,
"initializing_shards":0,
"unassigned_shards":10,
"delayed_unassigned_shards":0,
"number_of_pending_tasks":0,
"number_of_in_flight_fetch":0,
"task_max_waiting_in_queue_millis":0,
"active_shards_percent_as_number":99.2962702322

Según esos resultados veo que tengo una gran cantidad de fragmentos para un solo nodo
¿ Cómo podría reducir esos fragmentos? para ver si se corrige el problema.

O en tal caso crear otro nodo en el mismo servidor para mejorar el rendimiento.

En lo que me puedan ayudar se los agradecería.

Saludos

Ugo_Sangiorgi · February 6, 2018, 5:04pm

Hola Josmell,
Usted puede controlar la cantidad de shards primarios y replicas usando la API settings en sus indices:
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-update-settings.html

Los valores por defecto son 5 shards primarios y 1 replica por shard (total de 10 shards por indice), lo que talvez sea demasiado para su caso.

No es posible alterar la cantidad de shards primarios despues de creado el indice, por lo que usted debe hacer un reindex si quieres reducir la cantidad de shards primarios.
La cantidad de replicas puede ser alterada de manera dinamica (despues de creado el indice) sin problema.

Si sus indices son creados de manera automatica (por ejemplo indices diarios criados desde logstash) es importante tener templates para gerenciar la cantidad de shards: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-templates.html

Josmell_Chavarri · February 19, 2018, 6:39pm

Hola Ugo

Muchas gracias por la respuesta, efectivamente se están generando indices diarios y estoy utilizando un template.

Lo primero que hice fue verificar los fragmentos que no estaban asignados por alguna razón, te colocó un extracto del resultado

GET _cat/shards?h=index,shard,prirep,state,unassigned.reason

indice-2017.10.31 4 p STARTED
indice-2017.10.31 3 p STARTED
indice-2017.10.31 1 p STARTED
indice-2017.10.31 2 p STARTED
indice-2017.10.31 0 p STARTED
indice-2018.02.07 3 p STARTED
indice-2018.02.07 3 r UNASSIGNED CLUSTER_RECOVERED
indice-2018.02.07 4 p STARTED
indice-2018.02.07 4 r UNASSIGNED CLUSTER_RECOVERED
indice-2018.02.07 1 p STARTED
indice-2018.02.07 1 r UNASSIGNED CLUSTER_RECOVERED
indice-2018.02.07 2 p STARTED
indice-2018.02.07 2 r UNASSIGNED CLUSTER_RECOVERED
indice-2018.02.07 0 p STARTED
indice-2018.02.07 0 r UNASSIGNED CLUSTER_RECOVERED

Con esto verifiqué lo que me comentaste que tenia los valores por defecto de 5 shards con 1 replica para cada shards.

Luego modifique el template que se esta utilizando agregando las siguientes opciones para poder cambiar el número de shards y replicas que se generaban.

{
"orden" : 0 ,
"plantilla" : "template" ,
"configuración" : {
"index.refresh_interval" : "5s" ,
"number_of_shards" : 1 ,
"number_of_replicas" : 0
},
"mappings" : {
"..." : "..."
}
}

Luego de cargar nuevamente el template obtuve estos resultados:

1.- Las replicas se eliminaron
2.- Los indices nuevos que se generaron ya tienen 1 shards y sin réplicas.
3.- Aunque me sigue mostrando el error de "timeout after 30000 ms ", luego de ese tiempo ya me muestra los datos de la búsqueda.

indice-2018.02.08 0 p STARTED --------> índice nuevo
indice-2018.02.07 1 p STARTED
indice-2018.02.07 3 p STARTED
indice-2018.02.07 4 p STARTED
indice-2018.02.07 2 p STARTED
indice-2018.02.07 0 p STARTED

Luego hice el reindex y se fue el error.