APM transaction data not compressed properly

If you are asking about a problem you are experiencing, please use the following template, as it will help us help you. If you have a different problem, please delete all of this text :slight_smile:

Kibana version: 7.5.2

Elasticsearch version: 7.5.2

APM Server version: 7.6.0

APM Agent language and version: Java 1.12.0

Our ELK stack and apm-servers are deployed using elastic operator and helm charts in kubernetes.
Is there anything special in your setup? We use AWS load balancer in front of APM servers

We have 6 ES data instances each with 15 VCPUs and 30 GIB of RAM, with 15 GIB of HEAP.

We are using EBS volumes, each with 800 GB of storage and 3000 dedicated IOPS.

Description of the problem including expected versus actual behavior. Please include screenshots (if relevant):

Our APM transaction data doesn't seem to compress very well. According to the documentation here (https://www.elastic.co/guide/en/apm/server/current/sizing-guide.html)

Indexing 100 unsampled transactions per second for 1 hour results in 360,000 documents. These documents use around 50 Mb of disk space.

We index around 35000 transactions per second, each hour we send around 126 million documents. At this moment we have 475 million documents in the index which should be around 66GB of data if compressed very well, but the primary index is at 180GB. This is not scalable for us.


Please let us know what we can do to cut down the disk space.

  • Thank you

We use different indices for different types of APM data and our transaction sample rate it very low "0.000005"

Hi and thanks for the question!

I completely agree that storing all unsampled transactions is non-scalable, and we are of course fully aware of that. There were some technical limitations until recently that prevented us from dropping those while still be able to provide accurate histograms.
The introduction of the histogram datatype provided us with the storage solution we needed, and while there are still missing pieces for the query side, we already started working on the architectural change that will allow these storage cost savings you are looking for by dropping unsampled transactions (would probably be opt-in, at least until the next major release). I believe this exactly addresses your concern.

I assume what you did is simply extrapolating from

Indexing 100 unsampled transactions per second for 1 hour results in 360,000 documents. 
These documents use around 50 Mb of disk space.

However, you should take into consideration all other notices in this page. Specifically for the compression part, the note at the bottom is very relevant:

These examples were indexing the same data over and over with minimal variation. 
Because of that, the compression ratios observed of 80-90% are somewhat optimistic.

You would not get the same compression ratio with a real-world data.

I hope this helps.

1 Like

Thank you, I guess I'll wait for those features to be released.