GO Agent: apmhttp middleware

// ServeHTTP delegates to h.Handler, tracing the transaction with
// h.Tracer, or apm.DefaultTracer if h.Tracer is nil.
func (h *handler) ServeHTTP(w http.ResponseWriter, req *http.Request) {
	if !h.tracer.Active() || h.requestIgnorer(req) {
		h.handler.ServeHTTP(w, req)
		return
	}
	tx, req := StartTransaction(h.tracer, h.requestName(req), req)
	defer tx.End()

	body := h.tracer.CaptureHTTPRequestBody(req)
	w, resp := WrapResponseWriter(w)
	defer func() {
		if v := recover(); v != nil {
			if resp.StatusCode == 0 {
				w.WriteHeader(http.StatusInternalServerError)
			}
			h.recovery(w, req, resp, body, tx, v)
		}
		SetTransactionContext(tx, req, resp, body)
		body.Discard()
	}()
	h.handler.ServeHTTP(w, req)
	if resp.StatusCode == 0 {
		resp.StatusCode = http.StatusOK
	}
}

For me seems not really good idea for recovery here - breaking of single responsibility principle.
I think it should throw panic to a higher level after logging.

// ServeHTTP delegates to h.Handler, tracing the transaction with
// h.Tracer, or apm.DefaultTracer if h.Tracer is nil.
func (h *handler) ServeHTTP(w http.ResponseWriter, req *http.Request) {
	if !h.tracer.Active() || h.requestIgnorer(req) {
		h.handler.ServeHTTP(w, req)
		return
	}
	tx, req := StartTransaction(h.tracer, h.requestName(req), req)
	defer tx.End()

	body := h.tracer.CaptureHTTPRequestBody(req)
	w, resp := WrapResponseWriter(w)
	defer func() {
                v := recover()
		if v != nil {
			if resp.StatusCode == 0 {
				w.WriteHeader(http.StatusInternalServerError)
			}
		}
		SetTransactionContext(tx, req, resp, body)
		body.Discard()

               panic(v)
	}()
	h.handler.ServeHTTP(w, req)
	if resp.StatusCode == 0 {
		resp.StatusCode = http.StatusOK
	}
}

@lobotomist thanks for sharing your thoughts here.

I tend to agree, it probably wasn't the best decision to suppress panics in the handlers by default. There is an issue about this already: https://github.com/elastic/apm-agent-go/issues/540. I think we can consider changing the behaviour in the next major version of the agent.

In the mean time, I would be happy to review a PR to add an option to resume panicking at the end of the defer, like you illustrate here.

It would be really cool. Because I have my middleware for recovery and I have to dance around 'apmhttp.RecoveryFunc' function to apply my middleware there. :+1:

Also, I have one question.

if v := recover(); v != nil {
			if resp.StatusCode == 0 {
				w.WriteHeader(http.StatusInternalServerError)
			}
			h.recovery(w, req, resp, body, tx, v)
		}

So my opinion:

  1. APM agent must track all panics.
  2. Checks the flag for panic forwarding.

I'll prepare a pull request for review.

I found two way:

  1. Add WithoutRecovery option which completely disables recovery logic.
  2. Add WithPanicPropagation which throws the panic after h.recovery call.

Second way looks more convenient for me because it will keep the logic of error recording in Elastic APM.
But in this case, it will set 500 response code and makes it immutable for next middlewares.

@axw what do you think?

Both are potentially useful, but #2 sounds best for your case.

Right. The expectation has been that the tracing middleware is the outer-most middleware. which is why it's doing panic recovery in the first place. So these are the options I see:

  1. Disable recovery: this means errors won't be reported to Elastic APM. You could report them separately, in your higher level middleware, but they will lose their transaction context.
  2. Set the status code to 500 (making the response headers immutable), report error to Elastic APM, propagate panic.
  3. Assume a status code of 500 without writing it to the response, report error to Elastic APM, propagate panic.

I think we should go with #3, but this only makes sense when propagating the panic, as otherwise the status code won't be set and will default to 200. So I think it should look something like this:

  • If not propagating panic, do as we do today: recover, set the status code to 500 if not already set, and report error.
  • If propagating panic: recover, report error, assume status code is 500 if not already set, propagate panic. If the outer middleware recovers the panic and changes the status code, the status code reported by APM will be inconsistent, but there's not much we can do about that.