Indexing stops at low number of records due to GC overhead with proper heap settings


(Viktor) #1

Hello,

My Elasticsearch stops to index when after certain number of logs are sent to elastic via logstash and I keep seeing

[2018-03-08T15:44:34,677][INFO ][o.e.m.j.JvmGcMonitorService] [aTs5egi] [gc][14522] overhead, spent [460ms] collecting in the last [1s]
2018-03-08T15:44:35.562-0800	ERROR	logstash/async.go:235	Failed to publish events caused by: read tcp 10.244.9.182:30789->10.244.9.182:5044: i/o timeout
2018-03-08T15:44:35.562-0800	ERROR	logstash/async.go:235	Failed to publish events caused by: read tcp 10.244.9.182:30789->10.244.9.182:5044: i/o timeout
2018-03-08T15:44:35.643-0800	ERROR	logstash/async.go:235	Failed to publish events caused by: client is not connected
[2018-03-08T15:44:35,677][WARN ][o.e.m.j.JvmGcMonitorService] [aTs5egi] [gc][14523] overhead, spent [563ms] collecting in the last [1s]
2018-03-08T15:44:36.643-0800	ERROR	pipeline/output.go:92	Failed to publish events: client is not connected
[2018-03-08T15:44:36,678][INFO ][o.e.m.j.JvmGcMonitorService] [aTs5egi] [gc][14524] overhead, spent [411ms] collecting in the last [1s]
2018-03-08T15:44:37.443-0800	INFO	[monitoring]	log/log.go:124	Non-zero metrics in the last 30s	{"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":980,"time":986},"total":{"ticks":13470,"time":13480,"value":13470},"user":{"ticks":12490,"time":12494}},"info":{"ephemeral_id":"56497923-61d1-4f43-8caa-896a9f2f8fab","uptime":{"ms":1500016}},"memstats":{"gc_next":17106224,"memory_alloc":14692256,"memory_total":736280280}},"filebeat":{"harvester":{"open_files":99,"running":98}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"batches":9,"failed":18432,"total":18432},"read":{"errors":3},"write":{"bytes":911900}},"pipeline":{"clients":1,"events":{"active":4117,"retry":24576}}},"registrar":{"states":{"current":7}},"system":{"load":{"1":4.91,"15":4.75,"5":5.09,"norm":{"1":2.455,"15":2.375,"5":2.545}}}}}}

Below is the result of jps -l -m -v:

-bash-4.1$ jps -l -m -v
21862 org.logstash.Logstash -f app-store-pipeline.conf -Xms10g -Xmx10g -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djruby.compile.invokedynamic=true -Djruby.jit.threshold=0 -XX:+HeapDumpOnOutOfMemoryError -Djava.security.egd=file:/dev/urandom
7608 org.elasticsearch.bootstrap.Elasticsearch -Xms10g -Xmx10g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.io.tmpdir=/tmp/elasticsearch.18wEpBH6 -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -Xloggc:logs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=32 -XX:GCLogFileSize=64m -Des.path.home=/scratch/elk/elasticsearch-6.2.2 -Des.path.conf=/scratch/elk/elasticsearch-6.2.2/config
23258 sun.tools.jps.Jps -l -m -v -Dapplication.home=/ds1/home/oracle/jdk/jdk1.8.0_102 -Xms8m

The total count stops at 4096:

{
  "count" : 4096,
  "_shards" : {
    "total" : 440,
    "successful" : 440,
    "skipped" : 0,
    "failed" : 0
  }
}

It has been a hour since my last log is indexed into elasticsearch. Could someone please help? Thanks!


(Christian Dahlqvist) #2

Is there anything in the Elasticsearch logs?


(Viktor) #3

Hi Christian,

Thanks for replying! Most of them are gc overhead logs. I am appending some Error logs below:

[2018-03-08T17:03:14,013][ERROR][o.e.x.w.e.ExecutionService] [aTs5egi] could not store triggered watch with id [EdLkIMLuSfqd-gJ-lD74Bw_elasticsearch_cluster_status_4095cfc4-a054-49a4-bdf1-dd1797a73d0a-2018-03-09T01:03:14.009Z]: [RemoteTransportException[[aTs5egi][10.244.9.182:9300][indices:data/write/bulk[s][p]]]; nested: EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.TransportService$7@10505431 on EsThreadPoolExecutor[name = aTs5egi/bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@1c5f289b[Running, pool size = 2, active threads = 2, queued tasks = 200, completed tasks = 203316]]];]
[2018-03-08T17:03:13,878][ERROR][o.e.a.b.TransportBulkAction] [aTs5egi] failed to execute pipeline for a bulk request
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.ingest.PipelineExecutionService$2@1ee239e2 on EsThreadPoolExecutor[name = aTs5egi/bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@1c5f289b[Running, pool size = 2, active threads = 2, queued tasks = 200, completed tasks = 203316]]
	at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:48) ~[elasticsearch-6.2.2.jar:6.2.2]
	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) ~[?:1.8.0_102]
	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) ~[?:1.8.0_102]
	at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.doExecute(EsThreadPoolExecutor.java:98) ~[elasticsearch-6.2.2.jar:6.2.2]
	at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:93) ~[elasticsearch-6.2.2.jar:6.2.2]
	at org.elasticsearch.ingest.PipelineExecutionService.executeBulkRequest(PipelineExecutionService.java:75) ~[elasticsearch-6.2.2.jar:6.2.2]
	at org.elasticsearch.action.bulk.TransportBulkAction.processBulkIndexIngestRequest(TransportBulkAction.java:496) ~[elasticsearch-6.2.2.jar:6.2.2]
	at org.elasticsearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:135) ~[elasticsearch-6.2.2.jar:6.2.2]
	at org.elasticsearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:86) ~[elasticsearch-6.2.2.jar:6.2.2]
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:167) ~[elasticsearch-6.2.2.jar:6.2.2]
	at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$apply$0(SecurityActionFilter.java:103) ~[?:?]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60) ~[elasticsearch-6.2.2.jar:6.2.2]
	at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$authorizeRequest$4(SecurityActionFilter.java:188) ~[?:?]
	at org.elasticsearch.xpack.security.authz.AuthorizationUtils$AsyncAuthorizer.maybeRun(AuthorizationUtils.java:183) ~[?:?]

[2018-03-08T17:22:50,931][ERROR][i.n.u.c.D.rejectedExecution] Failed to submit a listener notification task. Event loop shut down?
java.util.concurrent.RejectedExecutionException: event executor terminated
	at io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:821) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor.offerTask(SingleThreadEventExecutor.java:327) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:320) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:746) ~[?:?]
	at io.netty.util.concurrent.DefaultPromise.safeExecute(DefaultPromise.java:760) ~[?:?]
	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:428) ~[?:?]
	at io.netty.util.concurrent.DefaultPromise.setFailure(DefaultPromise.java:113) ~[?:?]
	at io.netty.channel.DefaultChannelPromise.setFailure(DefaultChannelPromise.java:87) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.safeExecute(AbstractChannelHandlerContext.java:1010) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:825) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:794) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1036) ~[?:?]
	at io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:305) ~[?:?]
	at org.elasticsearch.http.netty4.Netty4HttpChannel.sendResponse(Netty4HttpChannel.java:146) ~[?:?]
	at org.elasticsearch.rest.RestController$ResourceHandlingHttpChannel.sendResponse(RestController.java:491) ~[elasticsearch-6.2.2.jar:6.2.2]
	at org.elasticsearch.rest.action.RestResponseListener.processResponse(RestResponseListener.java:37) ~[elasticsearch-6.2.2.jar:6.2.2]


(Christian Dahlqvist) #4

How many indices and shards are you actively indexing into? How many indices and shards do you have in the cluster?


(Viktor) #5

currently 88 indices and 400 shards


(Christian Dahlqvist) #6

How large is your Elasticsearch heap?


(Viktor) #7

10G allocated for heap


(Viktor) #8

I tried to delete indices, flush memory and force merge. But every time I restart the ELK stack, elasticsearch stops indexing after 4096. The /logstash*/_count?pretty gives me the following result and gc overhead started to going on forever

{
  "count" : 4096,
  "_shards" : {
    "total" : 415,
    "successful" : 415,
    "skipped" : 0,
    "failed" : 0
  }
}

(Viktor) #9

There are also some errors in logstash log like below:
[2018-03-09T14:06:45,436][ERROR][logstash.filters.ruby ] Ruby exception occurred: undefined method /' for "12044":String [2018-03-09T14:06:45,438][ERROR][logstash.filters.ruby ] Ruby exception occurred: undefined method/' for "317":String
[2018-03-09T14:06:45,440][ERROR][logstash.filters.ruby ] Ruby exception occurred: undefined method /' for "585531":String [2018-03-09T14:06:45,442][ERROR][logstash.filters.ruby ] Ruby exception occurred: undefined method/' for "301":String
[2018-03-09T14:06:45,444][ERROR][logstash.filters.ruby ] Ruby exception occurred: undefined method /' for "158755":String [2018-03-09T14:06:45,447][ERROR][logstash.filters.ruby ] Ruby exception occurred: undefined method/' for "221286":String

Are they related to the issue?


(Viktor) #10

Hi Christian,

I found that once I removed the fingerprint block in logstach conf file, indexing seems to be fine.

fingerprint {
        source => "message"
        target => "[@metadata][fingerprint]"
	    method => "SHA1"
    	key => "Log analytics"
    	base64encode => true
    }

Do you have any suggestions on de-duplication and avoiding performance issues?

Thanks!


(system) #11

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.