Apm-agent-java resets connection to server in couple of seconds

Hi!

I have APM Server instance, it works fine, getting metrics from some java and python agents, showing perfect graphs.

Now I'm trying to add agent to another tomcat application and here is problem.
Agent successfully sends several events to APM Server after tomcat start, but later it drops connections to intake/v2/events in the middle of http request, so server can not get events.

Here is access logs from http balancer (nginx) in front of apm server.
10.99.1.9 - agent's machine, rt is request time, ut - upstream response time, ua - upstream address, ul - upstream response length
First events request goes well, then almost all requests fails:

10.99.1.9 - - [22/Jun/2020:19:15:34 +0300] "POST /config/v1/agents HTTP/1.1" 200 59 "-" "elasticapm-java/1.17.0" "-" rt=0.039 ua="127.0.0.1:8200" us="200" ut="0.036" ul="59" cs=-
10.99.1.9 - - [22/Jun/2020:19:15:44 +0300] "POST /intake/v2/events HTTP/1.1" 202 0 "-" "elasticapm-java/1.17.0" "-" rt=10.223 ua="127.0.0.1:8200" us="202" ut="0.216" ul="0" cs=-
10.99.1.9 - - [22/Jun/2020:19:15:45 +0300] "POST /intake/v2/events HTTP/1.1" 400 0 "-" "elasticapm-java/1.17.0" "-" rt=0.476 ua="-" us="-" ut="-" ul="-" cs=-
10.99.1.9 - - [22/Jun/2020:19:15:46 +0300] "POST /intake/v2/events HTTP/1.1" 400 0 "-" "elasticapm-java/1.17.0" "-" rt=0.883 ua="-" us="-" ut="-" ul="-" cs=-
10.99.1.9 - - [22/Jun/2020:19:15:50 +0300] "POST /intake/v2/events HTTP/1.1" 400 0 "-" "elasticapm-java/1.17.0" "-" rt=1.556 ua="-" us="-" ut="-" ul="-" cs=-
10.99.1.9 - - [22/Jun/2020:19:15:52 +0300] "POST /intake/v2/events HTTP/1.1" 400 0 "-" "elasticapm-java/1.17.0" "-" rt=1.423 ua="-" us="-" ut="-" ul="-" cs=-
10.99.1.9 - - [22/Jun/2020:19:15:55 +0300] "POST /intake/v2/events HTTP/1.1" 400 0 "-" "elasticapm-java/1.17.0" "-" rt=1.761 ua="-" us="-" ut="-" ul="-" cs=-

Here is cause of error 400 in balancer log:

2020/06/22 19:15:52 [info] 34546#34546: *760 recv() failed (104: Connection reset by peer), client: 10.99.1.9, server: sv-apm01, request: "POST /intake/v2/events HTTP/1.1", host: "eapm-api.lavr.me"
2020/06/22 19:15:55 [info] 34546#34546: *761 recv() failed (104: Connection reset by peer), client: 10.99.1.9, server: sv-apm01, request: "POST /intake/v2/events HTTP/1.1", host: "eapm-api.lavr.me"

It is not timeout issue.
As balancer log shows, agent successfully posts first events in 10 seconds (actually in 0.216sec), but later it breaks connection after 0.476 - 1.7 seconds.

I've checked https://www.elastic.co/guide/en/apm//server/current/common-problems.html#io-timeout - tried to increase all timeouts, no success.

It's not request size problem, described in https://www.elastic.co/guide/en/apm//server/current/common-problems.html#event-too-large
As balancer log shows, it receives about 7Kb from agent before agent resets connection.

Tried send events without balancer, directly to port 8200 - same error in agent logs, no events in apm server. So, it is not balancer issue. And I'm quite sure it is not apm server issue.

In agent log:

2020-06-22 18:01:10,173 [elastic-apm-server-reporter] ERROR co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Error sending data to APM server: Error writing request body to server, response code is -1
2020-06-22 18:01:10,173 [elastic-apm-server-reporter] WARN  co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - null
2020-06-22 18:01:10,216 [elastic-apm-server-reporter] ERROR co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Failed to handle event of type SPAN with this error: Unable to write to target stream.
2020-06-22 18:01:10,217 [elastic-apm-server-reporter] INFO  co.elastic.apm.agent.shaded.lmax.disruptor.IgnoreExceptionHandler - Exception processing: 4375 co.elastic.apm.agent.report.ReportingEvent@7fefd1e0
co.elastic.apm.agent.shaded.dslplatform.json.SerializationException: Unable to write to target stream.
	at co.elastic.apm.agent.shaded.dslplatform.json.JsonWriter.flush(JsonWriter.java:581) ~[?:?]
	at co.elastic.apm.agent.report.serialize.DslJsonSerializer.flush(DslJsonSerializer.java:139) ~[?:?]
	at co.elastic.apm.agent.report.AbstractIntakeApiHandler.endRequest(AbstractIntakeApiHandler.java:122) ~[?:?]
	at co.elastic.apm.agent.report.IntakeV2ReportingEventHandler.endRequest(IntakeV2ReportingEventHandler.java:162) ~[?:?]
	at co.elastic.apm.agent.report.IntakeV2ReportingEventHandler.handleEvent(IntakeV2ReportingEventHandler.java:101) ~[?:?]
	at co.elastic.apm.agent.report.IntakeV2ReportingEventHandler.onEvent(IntakeV2ReportingEventHandler.java:73) ~[?:?]
	at co.elastic.apm.agent.report.IntakeV2ReportingEventHandler.onEvent(IntakeV2ReportingEventHandler.java:44) ~[?:?]
	at co.elastic.apm.agent.shaded.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:168) [?:?]
	at co.elastic.apm.agent.shaded.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125) [?:?]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
Caused by: java.io.IOException: Error writing request body to server
	at sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3536) ~[?:1.8.0_151]
	at sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3519) ~[?:1.8.0_151]
	at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253) ~[?:1.8.0_151]
	at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211) ~[?:1.8.0_151]
	at co.elastic.apm.agent.shaded.dslplatform.json.JsonWriter.flush(JsonWriter.java:579) ~[?:?]
	... 9 more

Tomcat startup config:

/opt/jre1.8.0_151/bin/java ... -javaagent:/usr/share/tomcat8/lib/elastic-apm-agent-1.17.0.jar -Delastic.apm.config_file=/srv/sp/conf/elasticapm.properties ... start

Content of /srv/sp/conf/elasticapm.properties:

log_level=INFO
environment=production
server_timeout=5s
api_request_time=10s
api_request_size=1mb
metrics_interval=5s
server_urls=http://sv-apm01.lavr.me/
service_name=sp
log_file=/srv/sp/logs/elasticapm.log

Versions:
Apache Tomcat/8.0.32
Ubuntu 16.04
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)

What can it be?
Appreciate any help.

Welcome to the forum :wave:

What's the difference between the Java application where the agent runs as expected and the application that gets the errors? Are you using a different agent version? Are the applications running in different hosts and/or network configurations?

It seems that there's something between the agent and the server that closes the connection or doesn't properly support chunked-encoded HTTP requests. Using traceroute, you could try to find out what's the path the requests take on their way from the agent to the server. Do you have access to the logs of the network devices that make up those hops?
Could you try to deploy an agent from the same host the server is running on and see if it works? If it does, you can try to find out what the problematic hop is.

BTW, the Python agent currently doesn't stream events via a chunked-encoded HTTP requests. It rather buffers into memory and sends batched requests.

Issue was caused by network problems, more specific by IPS (Intrusion Prevention System) from our security guys.

Here is policy from Checkpoint IPS than blocked java agent connection:
2020-06-23 15.00.35

@felixbarny thanks for you help

1 Like

Thanks for sharing!