Kibana version: 8.16.0
Elasticsearch version: 8.16.0
APM Server version: 8.16.0
APM Agent language and version: Java, 1.52.0
Original install method (e.g. download page, yum, deb, from source, etc.) and version: ECK 2.15.0
Fresh install or upgraded from other version?
New cluster with data restored from a snapshot taken from old cluster (v 7.17.20)
Is there anything special in your setup?
We use the standalone APM-server
Description of the problem including expected versus actual behavior. Please include screenshots (if relevant):
After migrating to a new 8-version cluster, the APM throughput graphs for our high-traffic apps using Java agents show about 90% reduction in transactions. My first guess was that this is caused by APM-server dropping unsampled transactions in the new version, but the throughput doesn't change when we change the sample rate.
When comparing the APM monitoring metrics from old and new cluster, I can see that the "Request/Response Count Intake Api" has the same amount of requests as before, but Output Events Rate is significantly lower.
So it seems that the APM-server takes in transactions as before, but then silently drops a significant portion of them without reporting any reason for the behaviour.
I've tried different Java Agent versions and settings, but to no avail. Other agents report the same amount of transactions as before the migration. And some Java agents seem to work too, without any difference in their settings to not-working ones.
I also rolled over the new APM-indices, which didn't fix the issue.
The agents don't report any errors and the only error from APM-server is:
{"log.level":"warn","@timestamp":"2024-11-19T17:05:06.919Z","log.logger":"agentcfg","log.origin":{"function":"github.com/elastic/apm-server/internal/agentcfg.(*ElasticsearchFetcher).Run.func1","file.name":"agentcfg/elasticsearch.go","file.line":150},"message":"refresh cache error: json: cannot unmarshal number into Go struct field .hits.hits._source.settings of type string","service.name":"apm-server","ecs.version":"1.6.0"}
I understand this is related to agent remote configuration, which seems to work despite this error, as when I update a config, I can see that the agents receive their new configuration values.
Here's screenshots of the APM monitoring graphs from old and new cluster, and an example of dropped throughputs from the APM Service list in Kibana.
If someone has any idea what could cause this behaviour, I'd be very grateful for your help. Been trying to solve this for a week already and starting to run out of ideas.