Sanitize_field_names confusion, and minimizing payload size

Elasticsearch version:
Kibana version:
APM Server version:
7.5.1

APM Agent language and version:
Java Agent 1.13.0

Fresh install or upgraded from other version?
Fresh Docker-compose

Description of the problem including expected versus actual behavior. Please include screenshots (if relevant):
I expect sanitize_field_names=*url* in the elasticapm.properties to successfully sanitize url.path and other url keys in the documents sent up to APM server. However, it appears that the sanitization does not occur.

Here is what the document looks like once received:


Whereas I expect those values to become [REDACTED]
Am I not able to redact non-header information? This would be very useful to me.

Steps to reproduce:

  1. Setup APM Server and Java agent 1.13.0 in a standard spring project.
  2. Add this property to elasticapm.properties sanitize_field_names=*url*
  3. Send some requests to the endpoint, and wait for APM data to arrive in kibana.

Finally, is there any way to minimize payload size? I have already disabled every instrumentation I can think of, and have capture_headers=false but the documents still have lots of data I don't care about, such as the following:

Is there any way to remove those, either via config or the public API?

Cheers,

--Tadgh

Hello and thanks for the questions.

You seem to have misunderstood this configuration option, see the explanation and default values. You should specify the key names of the traced data you want to sanitize (e.g. request/response headers), not Elastic schema fields.
Moreover, this option is intended for hiding sensitive data, so it is not applied to data that is not expected to contain sensitive information, like URL paths.

What data do you care about? What are you trying to minimize- the amount of data sent from the agent to the APM server, or the data stored in Elasticsearch?

Specifically the amount of data stored in elasticsearch. Almost every default field in the common schema that is stored here is irrelevant to me and I'd rather save the space if possible.

I have a use case in which it is recommended to keep the logs regarded as sensitive as the records themselves, which includes the URL path. From the FHIR HTTP docs

Note: Supporting GET means that PHI (Personal health information) might appear in search parameters, and therefore in HTTP logs. For this reason logs should be regarded as being as sensitive as the resources themselves. This is a general requirement irrespective of the use of GET - see the security page for further commentary.

Which is why I am trying to do this in the first place. I'm very aware that data in the url should not be sensitive, but do I have a use case for this, edge-case as it may be :smiley:

In that case, you may want to explore options around ingest node pipelines (specifically, the remove processor may be handy), to pre-process documents before they get stored.
In addition (or instead), you can consider ILM settings to manage indices according to any policy that fits your needs.

Sure, that wasn't an intention to educate you, but to explain why we do not apply it to paths. That's a good point and we may consider applying sanitation to URL paths as well. But currently, you may be able to use the pipeline settings referred above, with the proper processors of your choice, in order to remove or redact the relevant URL fields.

Yeah, I was hoping there was some agent config that would prevent it from being sent at the source as I was trying to avoid writing an ingest pipeline, but no big deal. Pipeline written, task done. Thanks for your input!

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.