ELK Server Failed

Hi,

My ELK server failed 2 days again with an issue in Elasticsearch for 'Too Many Files Open' I managed to resolved this moving out old indexes.

Now I can access Kibana, but im not seeing any new indexes being created nor any data being received.

I have received the pipeline blocked error in my logstash log files, so following information I increased my congestion_threashold to 400 (large number to stop the circuit breaker)

Im not seeing anything being indexed and receive message=>"retrying failed action with response code: 503", :level=>:warn} in the logs

Any suggestions?

Thanks

What is the error in the elasticsearch logs?

None - Last entry is I just restarted the server a minute or 2 before:

[2018-08-06 15:14:31,619][INFO ][node ] [nms1] started
[2018-08-06 15:14:47,665][INFO ][gateway ] [nms1] recovered [413] indices into cluster_state

How much heap do you have assigned? How many indices and shards are there in the cluster?

cluster_name":"nms",
"status":"red",
"timed_out":false,
"number_of_nodes":1,
"number_of_data_nodes":1,
"active_primary_shards":2043,
"active_shards":2043,
"relocating_shards":0,
"initializing_shards":0,
"unassigned_shards":2077,
"delayed_unassigned_shards":0,
"number_of_pending_tasks":0,
"number_of_in_flight_fetch":0,
"task_max_waiting_in_queue_millis":0,
"active_shards_percent_as_number":49.5873786407767}

heap size [19.9gb]

Some detail above...

So... I have noticed my cluster is red and it's loaded only 49% of my shards... Could this be due to moving and not using XDELETE on my old indexs?

I have put 1 back and restarted and can see the active primary shards has increased

Would this stop the data getting in?

I think you have far too many shards for a cluster that size. Please read this blog post around shards and sharding.

I plan to do this after I can get the server working again. I have had this working with more shards so I should be able to get it to a workable state first.

Where else can I go with this?

Ok so I have got a little further... I shutdown ELK, moved out some indexes from around the time the system crashed and restarted. I got a burst of data into the system. The my OSSEC servers received a connection refused error.

I have worked out I can get the data through with series of service restarts that flush the data through the pipe, logstash services followed by filebeat.

This is not ideal, but hopefully as the data catches up logstash will be ok to handle this realtime

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.