Indexing stops at low number of records due to GC overhead with proper heap settings

Hello,

My Elasticsearch stops to index when after certain number of logs are sent to elastic via logstash and I keep seeing

[2018-03-08T15:44:34,677][INFO ][o.e.m.j.JvmGcMonitorService] [aTs5egi] [gc][14522] overhead, spent [460ms] collecting in the last [1s]
2018-03-08T15:44:35.562-0800	ERROR	logstash/async.go:235	Failed to publish events caused by: read tcp 10.244.9.182:30789->10.244.9.182:5044: i/o timeout
2018-03-08T15:44:35.562-0800	ERROR	logstash/async.go:235	Failed to publish events caused by: read tcp 10.244.9.182:30789->10.244.9.182:5044: i/o timeout
2018-03-08T15:44:35.643-0800	ERROR	logstash/async.go:235	Failed to publish events caused by: client is not connected
[2018-03-08T15:44:35,677][WARN ][o.e.m.j.JvmGcMonitorService] [aTs5egi] [gc][14523] overhead, spent [563ms] collecting in the last [1s]
2018-03-08T15:44:36.643-0800	ERROR	pipeline/output.go:92	Failed to publish events: client is not connected
[2018-03-08T15:44:36,678][INFO ][o.e.m.j.JvmGcMonitorService] [aTs5egi] [gc][14524] overhead, spent [411ms] collecting in the last [1s]
2018-03-08T15:44:37.443-0800	INFO	[monitoring]	log/log.go:124	Non-zero metrics in the last 30s	{"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":980,"time":986},"total":{"ticks":13470,"time":13480,"value":13470},"user":{"ticks":12490,"time":12494}},"info":{"ephemeral_id":"56497923-61d1-4f43-8caa-896a9f2f8fab","uptime":{"ms":1500016}},"memstats":{"gc_next":17106224,"memory_alloc":14692256,"memory_total":736280280}},"filebeat":{"harvester":{"open_files":99,"running":98}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"batches":9,"failed":18432,"total":18432},"read":{"errors":3},"write":{"bytes":911900}},"pipeline":{"clients":1,"events":{"active":4117,"retry":24576}}},"registrar":{"states":{"current":7}},"system":{"load":{"1":4.91,"15":4.75,"5":5.09,"norm":{"1":2.455,"15":2.375,"5":2.545}}}}}}

Below is the result of jps -l -m -v:

-bash-4.1$ jps -l -m -v
21862 org.logstash.Logstash -f app-store-pipeline.conf -Xms10g -Xmx10g -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djruby.compile.invokedynamic=true -Djruby.jit.threshold=0 -XX:+HeapDumpOnOutOfMemoryError -Djava.security.egd=file:/dev/urandom
7608 org.elasticsearch.bootstrap.Elasticsearch -Xms10g -Xmx10g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.io.tmpdir=/tmp/elasticsearch.18wEpBH6 -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -Xloggc:logs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=32 -XX:GCLogFileSize=64m -Des.path.home=/scratch/elk/elasticsearch-6.2.2 -Des.path.conf=/scratch/elk/elasticsearch-6.2.2/config
23258 sun.tools.jps.Jps -l -m -v -Dapplication.home=/ds1/home/oracle/jdk/jdk1.8.0_102 -Xms8m

The total count stops at 4096:

{
  "count" : 4096,
  "_shards" : {
    "total" : 440,
    "successful" : 440,
    "skipped" : 0,
    "failed" : 0
  }
}

It has been a hour since my last log is indexed into elasticsearch. Could someone please help? Thanks!

Is there anything in the Elasticsearch logs?

Hi Christian,

Thanks for replying! Most of them are gc overhead logs. I am appending some Error logs below:

[2018-03-08T17:03:14,013][ERROR][o.e.x.w.e.ExecutionService] [aTs5egi] could not store triggered watch with id [EdLkIMLuSfqd-gJ-lD74Bw_elasticsearch_cluster_status_4095cfc4-a054-49a4-bdf1-dd1797a73d0a-2018-03-09T01:03:14.009Z]: [RemoteTransportException[[aTs5egi][10.244.9.182:9300][indices:data/write/bulk[s][p]]]; nested: EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.TransportService$7@10505431 on EsThreadPoolExecutor[name = aTs5egi/bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@1c5f289b[Running, pool size = 2, active threads = 2, queued tasks = 200, completed tasks = 203316]]];]
[2018-03-08T17:03:13,878][ERROR][o.e.a.b.TransportBulkAction] [aTs5egi] failed to execute pipeline for a bulk request
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.ingest.PipelineExecutionService$2@1ee239e2 on EsThreadPoolExecutor[name = aTs5egi/bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@1c5f289b[Running, pool size = 2, active threads = 2, queued tasks = 200, completed tasks = 203316]]
	at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:48) ~[elasticsearch-6.2.2.jar:6.2.2]
	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) ~[?:1.8.0_102]
	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) ~[?:1.8.0_102]
	at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.doExecute(EsThreadPoolExecutor.java:98) ~[elasticsearch-6.2.2.jar:6.2.2]
	at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:93) ~[elasticsearch-6.2.2.jar:6.2.2]
	at org.elasticsearch.ingest.PipelineExecutionService.executeBulkRequest(PipelineExecutionService.java:75) ~[elasticsearch-6.2.2.jar:6.2.2]
	at org.elasticsearch.action.bulk.TransportBulkAction.processBulkIndexIngestRequest(TransportBulkAction.java:496) ~[elasticsearch-6.2.2.jar:6.2.2]
	at org.elasticsearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:135) ~[elasticsearch-6.2.2.jar:6.2.2]
	at org.elasticsearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:86) ~[elasticsearch-6.2.2.jar:6.2.2]
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:167) ~[elasticsearch-6.2.2.jar:6.2.2]
	at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$apply$0(SecurityActionFilter.java:103) ~[?:?]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60) ~[elasticsearch-6.2.2.jar:6.2.2]
	at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$authorizeRequest$4(SecurityActionFilter.java:188) ~[?:?]
	at org.elasticsearch.xpack.security.authz.AuthorizationUtils$AsyncAuthorizer.maybeRun(AuthorizationUtils.java:183) ~[?:?]

[2018-03-08T17:22:50,931][ERROR][i.n.u.c.D.rejectedExecution] Failed to submit a listener notification task. Event loop shut down?
java.util.concurrent.RejectedExecutionException: event executor terminated
	at io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:821) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor.offerTask(SingleThreadEventExecutor.java:327) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:320) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:746) ~[?:?]
	at io.netty.util.concurrent.DefaultPromise.safeExecute(DefaultPromise.java:760) ~[?:?]
	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:428) ~[?:?]
	at io.netty.util.concurrent.DefaultPromise.setFailure(DefaultPromise.java:113) ~[?:?]
	at io.netty.channel.DefaultChannelPromise.setFailure(DefaultChannelPromise.java:87) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.safeExecute(AbstractChannelHandlerContext.java:1010) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:825) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:794) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1036) ~[?:?]
	at io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:305) ~[?:?]
	at org.elasticsearch.http.netty4.Netty4HttpChannel.sendResponse(Netty4HttpChannel.java:146) ~[?:?]
	at org.elasticsearch.rest.RestController$ResourceHandlingHttpChannel.sendResponse(RestController.java:491) ~[elasticsearch-6.2.2.jar:6.2.2]
	at org.elasticsearch.rest.action.RestResponseListener.processResponse(RestResponseListener.java:37) ~[elasticsearch-6.2.2.jar:6.2.2]

How many indices and shards are you actively indexing into? How many indices and shards do you have in the cluster?

currently 88 indices and 400 shards

How large is your Elasticsearch heap?

10G allocated for heap

I tried to delete indices, flush memory and force merge. But every time I restart the ELK stack, elasticsearch stops indexing after 4096. The /logstash*/_count?pretty gives me the following result and gc overhead started to going on forever

{
  "count" : 4096,
  "_shards" : {
    "total" : 415,
    "successful" : 415,
    "skipped" : 0,
    "failed" : 0
  }
}

There are also some errors in logstash log like below:
[2018-03-09T14:06:45,436][ERROR][logstash.filters.ruby ] Ruby exception occurred: undefined method /' for "12044":String [2018-03-09T14:06:45,438][ERROR][logstash.filters.ruby ] Ruby exception occurred: undefined method/' for "317":String
[2018-03-09T14:06:45,440][ERROR][logstash.filters.ruby ] Ruby exception occurred: undefined method /' for "585531":String [2018-03-09T14:06:45,442][ERROR][logstash.filters.ruby ] Ruby exception occurred: undefined method/' for "301":String
[2018-03-09T14:06:45,444][ERROR][logstash.filters.ruby ] Ruby exception occurred: undefined method /' for "158755":String [2018-03-09T14:06:45,447][ERROR][logstash.filters.ruby ] Ruby exception occurred: undefined method/' for "221286":String

Are they related to the issue?

Hi Christian,

I found that once I removed the fingerprint block in logstach conf file, indexing seems to be fine.

fingerprint {
        source => "message"
        target => "[@metadata][fingerprint]"
	    method => "SHA1"
    	key => "Log analytics"
    	base64encode => true
    }

Do you have any suggestions on de-duplication and avoiding performance issues?

Thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.