Huge portion of transactions missing from Java agents after upgrade from 7 to 8

Kibana version: 8.16.0

Elasticsearch version: 8.16.0

APM Server version: 8.16.0

APM Agent language and version: Java, 1.52.0

Original install method (e.g. download page, yum, deb, from source, etc.) and version: ECK 2.15.0

Fresh install or upgraded from other version?
New cluster with data restored from a snapshot taken from old cluster (v 7.17.20)

Is there anything special in your setup?
We use the standalone APM-server

Description of the problem including expected versus actual behavior. Please include screenshots (if relevant):

After migrating to a new 8-version cluster, the APM throughput graphs for our high-traffic apps using Java agents show about 90% reduction in transactions. My first guess was that this is caused by APM-server dropping unsampled transactions in the new version, but the throughput doesn't change when we change the sample rate.

When comparing the APM monitoring metrics from old and new cluster, I can see that the "Request/Response Count Intake Api" has the same amount of requests as before, but Output Events Rate is significantly lower.

So it seems that the APM-server takes in transactions as before, but then silently drops a significant portion of them without reporting any reason for the behaviour.

I've tried different Java Agent versions and settings, but to no avail. Other agents report the same amount of transactions as before the migration. And some Java agents seem to work too, without any difference in their settings to not-working ones.

I also rolled over the new APM-indices, which didn't fix the issue.

The agents don't report any errors and the only error from APM-server is:

{"log.level":"warn","@timestamp":"2024-11-19T17:05:06.919Z","log.logger":"agentcfg","log.origin":{"function":"github.com/elastic/apm-server/internal/agentcfg.(*ElasticsearchFetcher).Run.func1","file.name":"agentcfg/elasticsearch.go","file.line":150},"message":"refresh cache error: json: cannot unmarshal number into Go struct field .hits.hits._source.settings of type string","service.name":"apm-server","ecs.version":"1.6.0"}

I understand this is related to agent remote configuration, which seems to work despite this error, as when I update a config, I can see that the agents receive their new configuration values.

Here's screenshots of the APM monitoring graphs from old and new cluster, and an example of dropped throughputs from the APM Service list in Kibana.



If someone has any idea what could cause this behaviour, I'd be very grateful for your help. Been trying to solve this for a week already and starting to run out of ideas.

Hello @JamiK, the refresh cache error log in APM Server that you shared could result into unexpected transaction sample rate misconfiguration for the agents. Could you please share the output of the two following requests to confirm that the agents configuration is valid?

POST /.apm-agent-configuration/_search
{
    "_source": ["settings", "service"]
}

GET /.apm-agent-configuration/_mapping

Additionally you can check and edit the APM agent configuration following this guide.

Thanks for the reply!

Curiously it seems that some of our configs have transaction_sample_rate as string, but some as a float:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 16,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": ".apm-agent-configuration",
        "_id": "DMtkL5MBfpZW-3_BkYSL",
        "_score": 1,
        "_source": {
          "service": {
            "name": "<redacted>"
          },
          "settings": {
            "api_request_time": "20s",
            "server_timeout": "30s",
            "transaction_max_spans": "500",
            "transaction_sample_rate": "0.7",
            "api_request_size": "768kb"
          }
        }
      },
      {
        "_index": ".apm-agent-configuration",
        "_id": "UcTeVW4BevojlyjWDSTw",
        "_score": 1,
        "_source": {
          "service": {
            "name": "<redacted>"
            "environment": "production"
          },
          "settings": {
            "transaction_sample_rate": "0.02",
            "capture_body": "all",
            "transaction_max_spans": "500",
            "log_level": "info"
          }
        }
      },
      {
        "_index": ".apm-agent-configuration",
        "_id": "Ptn3VW4B40DewEjs4sNA",
        "_score": 1,
        "_source": {
          "service": {
            "name": "<redacted>"
          },
          "settings": {
            "transaction_sample_rate": 0.1,
            "capture_body": "off",
            "transaction_max_spans": 500
          }
        }
      },
      {
        "_index": ".apm-agent-configuration",
        "_id": "N8z3VW4BXBQSJkofesfW",
        "_score": 1,
        "_source": {
          "service": {
            "name": "<redacted>"
          },
          "settings": {
            "transaction_sample_rate": 0.1,
            "capture_body": "off",
            "transaction_max_spans": 500
          }
        }
      },
      {
        "_index": ".apm-agent-configuration",
        "_id": "UoL4VW4BO3MRhZu9X3Pu",
        "_score": 1,
        "_source": {
          "service": {
            "name": "<redacted>"
          },
          "settings": {
            "transaction_sample_rate": 0.1,
            "capture_body": "off",
            "transaction_max_spans": 500
          }
        }
      },
      {
        "_index": ".apm-agent-configuration",
        "_id": "i5L0gnABG08WBUGQPPD-",
        "_score": 1,
        "_source": {
          "service": {
            "name": "<redacted>"
          },
          "settings": {
            "transaction_sample_rate": 0.1,
            "capture_body": "off",
            "transaction_max_spans": 500
          }
        }
      },
      {
        "_index": ".apm-agent-configuration",
        "_id": "42zygnAB52R7WLXwBT6Z",
        "_score": 1,
        "_source": {
          "service": {
            "name": "<redacted>"
          },
          "settings": {
            "transaction_sample_rate": 0.05,
            "capture_body": "off",
            "transaction_max_spans": 500
          }
        }
      },
      {
        "_index": ".apm-agent-configuration",
        "_id": "YM4jYG4BLD7MAvTp1-Cw",
        "_score": 1,
        "_source": {
          "service": {
            "name": "<redacted>"
          },
          "settings": {
            "transaction_sample_rate": 0.07,
            "capture_body": "off",
            "transaction_max_spans": 500
          }
        }
      },
      {
        "_index": ".apm-agent-configuration",
        "_id": "wCDygnAB7DLZpWpE3Ro0",
        "_score": 1,
        "_source": {
          "service": {
            "name": "<redacted>"
          },
          "settings": {
            "transaction_sample_rate": 0.07,
            "capture_body": "off",
            "transaction_max_spans": 500
          }
        }
      },
      {
        "_index": ".apm-agent-configuration",
        "_id": "iO5vnHEBYgD417la_j6h",
        "_score": 1,
        "_source": {
          "service": {
            "name": "<redacted>"
          },
          "settings": {
            "transaction_sample_rate": 0.1,
            "capture_body": "off",
            "transaction_max_spans": 500
          }
        }
      }
    ]
  }
}

Here's the mapping for the index:

{
  ".apm-agent-configuration": {
    "mappings": {
      "dynamic": "strict",
      "dynamic_templates": [
        {
          "strings": {
            "match_mapping_type": "string",
            "mapping": {
              "ignore_above": 1024,
              "type": "keyword"
            }
          }
        }
      ],
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "agent_name": {
          "type": "keyword",
          "ignore_above": 1024
        },
        "applied_by_agent": {
          "type": "boolean"
        },
        "etag": {
          "type": "keyword",
          "ignore_above": 1024
        },
        "service": {
          "properties": {
            "environment": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "name": {
              "type": "keyword",
              "ignore_above": 1024
            }
          }
        },
        "settings": {
          "dynamic": "true",
          "properties": {
            "api_request_size": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "api_request_time": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "capture_body": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "log_level": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "server_timeout": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "transaction_max_spans": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "transaction_sample_rate": {
              "type": "keyword",
              "ignore_above": 1024
            }
          }
        }
      }
    }
  }
}

The .apm-agent-configuration was actually broken after the migration, with documents having error about the transaction_sample_rate field being malformed. I then re-created it via these steps:

  1. Cloned the index into .apm-agent-configuration-backup
  2. Deleted the .apm-agent-configuration
  3. Restarted Kibana, which re-created the index
  4. Reindexed the existing configs from .apm-agent-configuration-backup

This fixed the field malformed -error and the agents started to receive the configs again successfully, but it didn't fix the original problem.

I found some broken agent-configs from over 5 years ago, which I re-created. This fixed the remote configuration cache refresh problem!

I also found the root cause for the missing transactions. It was caused by the Java agent's Trace continuation strategy config parameter. The default value is continue and that seems to have caused the caller service transaction_sample_rate to dictate the sampling rate of the receiver end, at least for the throughput graphs. By setting the value to restart in the receiving end, the throughput graphs now include the external requests correctly!

Is this behaviour expected? Seems like a big change to how it worked in version 7?

Based on the distributed tracing documentation I understand that the caller service's sample rate decides the downstream sample rate. And this is expected behaviour.

But it seems that this leads to two options:

  • We don't get accurate throughput graphs, but distributed tracing works and we can see the full cross-service tracing in APM-UI
  • We get accurate throughput graphs, but distributed tracing visibility is reduced, as the downstream traces are not visible anymore in the caller's tracing. There is a span-link, but from the UX-perspective this is a lot worse than before

Also, clicking the span link button produces error: