Fix to libbeats to split bulk requests which are too large, not working in Elastic Agent 8.9.0?

In Beats, I see that this commit was merged in March 2023:

" Split large batches on error instead of dropping them"
PR 34911

I think that PR 34911 changed the Publish() logic:

If I look here:

func (client *Client) Publish(ctx context.Context, batch publisher.Batch) error {
	events := batch.Events()
	rest, err := client.publishEvents(ctx, events)

	switch {
	case errors.Is(err, errPayloadTooLarge):
		if batch.SplitRetry() {
			// Report that we split a batch
			client.observer.Split()
		} else {
			// If the batch could not be split, there is no option left but
			// to drop it and log the error state.
			batch.Drop()
			client.observer.Dropped(len(events))
			err := apm.CaptureError(ctx, fmt.Errorf("failed to perform bulk index operation: %w", err))
			err.Send()
			client.log.Error(err)
		}
		// Returning an error from Publish forces a client close / reconnect,
		// so don't pass this error through since it doesn't indicate anything
		// wrong with the connection.
		return nil
	case len(rest) == 0:
		batch.ACK()
	default:
		batch.RetryEvents(rest)
	}
	return err
}

So if my understanding of the logic is correct,
this function should try to split up a bulk request
and only fail if it cannot successfully split up the bulk request and send.

I am running elastic-agent 8.9.0 , and I do not seem
to see this behavior. I am getting hundreds of errors like the one I posted here:

A few questions, maybe @faec @Lee_Hinman @cmacknz can help:

  1. Does elastic-agent 8.9.0 have the fix in PR 34911 ?
  2. Is this fix working properly?
  3. How do I tell how large the payload that libbeat is trying to send to elasticsearch via a POST _bulk API call?
  4. Does libbeat ignore the setting of http.max_content_length on the server?

Any help you can give to help shed light on this,
and solve my problem would be greatly appreciated!

  1. Does elastic-agent 8.9.0 have the fix in PR 34911 ?

Yes it does, you can see the tags that include it on the merge commit Split large batches on error instead of dropping them (#34911) · elastic/beats@df59745 · GitHub

Is this fix working properly?

It doesn't appear to be having any effect here, and from what you've posted I suspect the reason might be that Elastic Agent isn't actually seeing the HTTP 413 response and is instead seeing write tcp [redacted]:33430->[redacted]:443: write: broken pipe as the error. The new code won't do anything unless it explicitly receives a 413 from Elasticsearch.

  1. How do I tell how large the payload that libbeat is trying to send to elasticsearch via a POST _bulk API call?

Ideally by getting a 413 response back

  1. Does libbeat ignore the setting of http.max_content_length on the server?

Yes because it doesn't know about it. Libbeat does not query this parameter and Elasticsearch does not give it back.

Are you sure that Elasticsearch does not give this parameter back?

If I do:

[GET _cluster/settings?flat_settings=true&include_defaults=true]( Cluster get settings API | Elasticsearch Guide [8.9] | Elastic

I see a bunch of setings, including:

 "http.max_content_length": "100mb",

If this parameter is queryable from Elasticsearch, would it be possible to
change elastic-agent / beats to query this parameter, and adjust the size of the
payload being sent by POST _bulk calls?

Yes sorry, I didn't mean it was impossible to query I meant it wasn't part of the _bulk response. I suppose we could query it but the batch splitting implementation was supposed to make that unnecessary. If we don't always get 413 responses back when a batch is too large it would make sense for us to do that.

No worries!
Should I submit an enhancement request for elastic-agent / beats to query http.max_content_length from Elasticsearch so that this ca be used as part of the batch splitting implementation?
Either somewhere in GitHub, or via https://support.elastic.co?

I would do both, create a public GitHub issue in Beats and also raise it through support. Going through support in addition to creating an issue will help with prioritization.

I have submitted:

  1. Enhance logic for splitting large payload requests by querying http.max_content_length from Elasticsearch · Issue #36534 · elastic/beats · GitHub
  2. Enhancement Request: #19624

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.