Update to apm-server 7.9 increased my APM doc count (a Lot!)

Hi!, I've been using APM with elastic for some months now. Everything's is fine so far but about 2 weeks ago I updated from 7.5 to 7.9. I noticed since then that the amount of documents in my apm* indices has grown, a lot (about 10x)

I've trying to see what's the source of this since there seems to be no noticeable change in the amount of transactions per unit of time.
So to be clear, the increment is in the spans indices.

Elaticstack version at 7.9 (Elasitc, Kibana and apmserver)

I'm using python apm agent 5.8

Any ideas why with the same "input" i'm getting that many more docs ?

Thanks
Rodrigo.

If the rate of transaction docs has remained stable, but the rate of span docs has increased, that could be due to a change in sampling rate.

You could check this by comparing the number of transaction docs with transaction.sampled: true vs transaction.sampled: false, using Discover.

If it is the case that the sampling rate has changed: as far as I know just upgrading the server from 7.5 to 7.9 wouldn't cause that. Did you (or someone else) also change agent sampling rate configuration around the same time?

Hi! thanks for you reply!
I just checked and no, my apm-7.5.1-transaction-000004 index has all docs with sampled in true; apm-7.9.0-transaction-000001 is showing the same sample value. I also checked the agent's config and it's set to transaction_sample_rate: 1
One thing that I had to "fix" was the ingest pipeline for apm: https://www.elastic.co/guide/en/apm/server/current/upgrading-to-77.html.
I did that yesterday but I see no effects on my docs yet (I understand that this should impact docs size and not doc count, but I thought it's interesting..)

Thanks!

Have you also recently updated the python agent or only the APM Server?

No I did not , we're running our django code with this elastic-apm==5.3.2 package

@Rodrigo_Jimenez the only thing that we (Silvia) have thought of as a possible reason is changes in the server's "self-instrumentation" (apm-server.instrumentation.* config). I don't suppose you have enabled that? If that is the case, then the spike in docs should all have service.name: "apm-server".

Otherwise, I'm really not sure what could have caused a spike in docs. If you can find anything common to the additional docs, then that might help us narrow it down.

Ok, so, first of all, thanks for taking the time (you and Silvia) to help out with this,

Regarding apm-server config, no , the only special config is:
setup.template.enabled: true
setup.template.overwrite: true
setup.template.append_fields:
- name: http.request.headers.Origin
type: keyword
(format looks wrong here, but the config syntax is ok)

The idea was to be able to aggregate by this field (I shows as indexable now in kibana , but filtering does not work. I always shows no documents :frowning: ). It's important for me since our React frontend will send the Origin field and that way I would be able to see which front end instances are hitting my API.

Back to the issue of increased volume, trying to find some field of type of new type of doc is the first thing I tried; basically, I grabbed the top 5 menu in Discover, but since it takes the last 500 docs, I decided to create some ad-hoc visualizations and aggregate by terms (span type, subtype, host.name). All I could find is that only one of the agents shows a spike.

You could see that the "api-prod" increased in volume around the 25th (when I updated Elastic, Kibana and apm-server , all from 7.5 to 7.9)

I don't know if this helps but:


this is the time histogram with span.type, span.subtype buckets. DB spans go crazy :frowning:

Also, this is the agent config for that app:

Thanks!

No smoking gun I'm afraid.

It's interesting that transaction_max_spans has a non-default value. That makes me wonder if there's a problem with agent central config, preventing the agent from fetching the config and falling back to the default of 500.

Does the "Agent remote configuration" UI indicate that the config has been applied?

image

image
It does :frowning:

Unless there is some coincidental change (which I suppose you have ruled out), the only other thing I can think of is that the agent has been dropping events because the server couldn't keep up.

We did implement some performance optimisations between 7.5 and 7.9, so it's conceivable that the server is able to keep up with a higher load now. This seems unlikely to explain your situation though, as it should manifest in transactions being dropped as well as spans.

Anyway, you could try comparing transaction.span_count.dropped between 7.5 and 7.9 to see if there's a change in number of dropped spans per transaction.

You may be getting closer with that.
About 10 days ago I changed the agent for that "api-prod" max span setting to 100 (it was taking too much space and messing up with my retention)

But I checked what you suggested. So first, this is the volume of transactions during the last 30 days:


Constant all the way..
This is Spans, same period:

Now check this out, this shows, on avg the amount of dropped spans during those 30 days

Finally, this is the amount of transactions which lost some span.


This one shows that, some spans were dropped after I changed the max span setting 10 days ago.

So, with all this, some things might have been happening at the same time.
First, I'm not sure that the 500 max spans was true while working on Stack 7.5
Second, after upgrade to 7.9 amount of span increased but dropped spans decreased so.. performance upgrade???
Third, the 100 max span is being honoured, i'm dropping some but the amount of spans are not similar to those while on Stack 7.5

What do you think?

Thanks!!!

Sorry, I think I confused things earlier by mentioning server performance and dropped spans together; there should be no relationship between server performance and the span_count.dropped. The count is increased whenever the agent decides to drop a span, due to the configured max spans being reached.

If the server is unable to process events quickly enough, its queue fills up and it will stop processing new events from the agent and return a "503 queue full" error. When this occurs, some events may be lost, but span_count.dropped will not capture this. If the performance increased in 7.9 sufficiently, then this may be happening less often, with fewer events being lost. However, looking again at the differences you're seeing I would be surprised if increased performance alone were enough to answer for them.

Do you have Stack Monitoring enabled? If so, have a look at the "Response Errors Intake API" graph and see if there was a significant change around the time you upgraded.

I do have that enabled but it seems that only 7-day retention for this so, no clue about errors intake :frowning: