ELK data not getting saved


(Rahul Kumar) #1

Hi awesome people

We are using ELK with Amazon SQS in an in-house server.

Elasticsearch log

[2018-05-15T11:39:03,411][INFO ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][old][982956][151886] duration [8.8s], collections [1]/[9.8s], total [8.8s]/[4.8d], memory [3.9gb]->[3.7gb]/[3.9gb], all_pools {[young] [399.4mb]->[288.8mb]/[399.4mb]}{[survivor] [41.6mb]->[0b]/[49.8mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2018-05-15T11:39:03,411][WARN ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][982956] overhead, spent [8.8s] collecting in the last [9.8s]
[2018-05-15T11:39:13,309][INFO ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][old][982958][151887] duration [8s], collections [1]/[8.8s], total [8s]/[4.8d], memory [3.9gb]->[3.8gb]/[3.9gb], all_pools {[young] [399.4mb]->[296.9mb]/[399.4mb]}{[survivor] [1.8mb]->[0b]/[49.8mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2018-05-15T11:39:13,309][WARN ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][982958] overhead, spent [8s] collecting in the last [8.8s]
[2018-05-15T11:39:20,377][INFO ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][982965] overhead, spent [432ms] collecting in the last [1s]
[2018-05-15T11:39:30,031][WARN ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][982966] overhead, spent [9s] collecting in the last [9.6s]
[2018-05-15T11:39:39,355][INFO ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][982975] overhead, spent [442ms] collecting in the last [1s]
[2018-05-15T11:39:47,357][INFO ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][982983] overhead, spent [401ms] collecting in the last [1s]
[2018-05-15T11:39:56,612][WARN ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][982984] overhead, spent [9s] collecting in the last [9.2s]
[2018-05-15T11:40:13,156][INFO ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][old][982992][151893] duration [9.2s], collections [1]/[9.3s], total [9.2s]/[4.8d], memory [3.9gb]->[3.7gb]/[3.9gb], all_pools {[young] [399.4mb]->[287mb]/[399.4mb]}{[survivor] [48.3mb]->[0b]/[49.8mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
[2018-05-15T11:40:13,156][WARN ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][982992] overhead, spent [9.2s] collecting in the last [9.3s]
[2018-05-15T11:40:22,385][INFO ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][983001] overhead, spent [432ms] collecting in the last [1s]
[2018-05-15T11:40:30,386][INFO ][o.e.m.j.JvmGcMonitorService] [node-1] [gc][983009] overhead, spent [442ms] collecting in the last [1s]

Logstash log

[2018-05-15T09:22:03,420][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$7@254974e8 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@7de20ac8[Running, pool size = 6, active threads = 6, queued tasks = 50, completed tasks = 60068]]"})
[2018-05-15T09:22:03,420][ERROR][logstash.outputs.elasticsearch] Retrying individual actions
[2018-05-15T09:22:03,420][ERROR][logstash.outputs.elasticsearch] Action
[2018-05-15T09:22:03,420][ERROR][logstash.outputs.elasticsearch] Action

After seeing the log and doing a lookup in the forum, for now, what I can understand is, memory allocated for various component of ELK is almost full.
And for now, the solution which I can think of now is to free some space, but I am not sure what and how to clear space.

Note: We don't need last year data, that can be cleared from everywhere

Need help, thanks.


(Jymit Singh Khondhu) #2

@Rahul_Kumar2, what is the output of _cluster/health?v? Are you over sharded? My current presumption would be your cluster is under memory pressure from having too many indices for the current resources, thus you are seeing bulk rejections see: _cat/thread_pool?v ergo the 429 push back to logstash.

Note the above excerpt of log tells us that the older generation of garbage collection is taking just less than 10seconds and was able free up .2GB of memory from the Elasticsearch heap.


(Rahul Kumar) #3

@JKhondhu Thanks for the reply.
I think I need to clear data (indices and logs). Should I clear the data? Will that help?
I am not sure could you guide me on this?

Meanwhile here the results you asked for
curl -XGET "localhost:9999/_cluster/health"

{ "cluster_name":"xxxx-api-logs",
"status":"red",
"timed_out":false,
"number_of_nodes":1,
"number_of_data_nodes":1,
"active_primary_shards":3577,
"active_shards":3577,
"relocating_shards":0,
"initializing_shards":4,
"unassigned_shards":3971,
"delayed_unassigned_shards":0,
"number_of_pending_tasks":144,
"number_of_in_flight_fetch":0,
"task_max_waiting_in_queue_millis":67954,
"active_shards_percent_as_number":47.364936440677965}

curl -XGET "localhost:9999/_cat/thread_pool?v"

node_name name active queue rejected
node-1 bulk 1 0 454
node-1 fetch_shard_started 0 0 0
node-1 fetch_shard_store 0 0 0
node-1 flush 3 1 0
node-1 force_merge 0 0 0
node-1 generic 1 0 0
node-1 get 0 0 0
node-1 index 0 0 0
node-1 listener 0 0 0
node-1 management 4 0 0
node-1 refresh 0 0 0
node-1 search 0 0 0
node-1 snapshot 0 0 0
node-1 warmer 0 0 0


(David Pilato) #4

You probably have too many shards per node.

May I suggest you look at the following resources about sizing:

https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing


(Rahul Kumar) #5

@JKhondhu @JKhondhu Implemented filebeat, apparently everything started to work, but don't know how long it will one. Do you guys have any idea, let me know.

Thanks


(Jymit Singh Khondhu) #6

Hi,

Implementing filebeat opposed to logstash will not solve your problem. You will see the same happen.

You have too much data (indices) residing in Elasticsearch. You need to a. delete data past a retention period that you need or b. grow your cluster size.

Close to 4000 shards on a single node with 4GB heap is crazy and asking for trouble. Please see all the links @dadoonet shared and start to put in place curator or your own job to delete indices past X amount of days and make resources available so your cluster is able to ingest new data coming in via logstash.


(Rahul Kumar) #7

@JKhondhu @dadoonet.. Thanks for the advice. I am looking in to the shared resources while also learning the ELK stack. For now, I have increased the heap size to 8GB and everything is working now. Will also put in curator.

Regards


(system) #8

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.