Huge portion of transactions missing from Java agents after upgrade from 7 to 8

Kibana version: 8.16.0

Elasticsearch version: 8.16.0

APM Server version: 8.16.0

APM Agent language and version: Java, 1.52.0

Original install method (e.g. download page, yum, deb, from source, etc.) and version: ECK 2.15.0

Fresh install or upgraded from other version?
New cluster with data restored from a snapshot taken from old cluster (v 7.17.20)

Is there anything special in your setup?
We use the standalone APM-server

Description of the problem including expected versus actual behavior. Please include screenshots (if relevant):

After migrating to a new 8-version cluster, the APM throughput graphs for our high-traffic apps using Java agents show about 90% reduction in transactions. My first guess was that this is caused by APM-server dropping unsampled transactions in the new version, but the throughput doesn't change when we change the sample rate.

When comparing the APM monitoring metrics from old and new cluster, I can see that the "Request/Response Count Intake Api" has the same amount of requests as before, but Output Events Rate is significantly lower.

So it seems that the APM-server takes in transactions as before, but then silently drops a significant portion of them without reporting any reason for the behaviour.

I've tried different Java Agent versions and settings, but to no avail. Other agents report the same amount of transactions as before the migration. And some Java agents seem to work too, without any difference in their settings to not-working ones.

I also rolled over the new APM-indices, which didn't fix the issue.

The agents don't report any errors and the only error from APM-server is:

{"log.level":"warn","@timestamp":"2024-11-19T17:05:06.919Z","log.logger":"agentcfg","log.origin":{"function":"github.com/elastic/apm-server/internal/agentcfg.(*ElasticsearchFetcher).Run.func1","file.name":"agentcfg/elasticsearch.go","file.line":150},"message":"refresh cache error: json: cannot unmarshal number into Go struct field .hits.hits._source.settings of type string","service.name":"apm-server","ecs.version":"1.6.0"}

I understand this is related to agent remote configuration, which seems to work despite this error, as when I update a config, I can see that the agents receive their new configuration values.

Here's screenshots of the APM monitoring graphs from old and new cluster, and an example of dropped throughputs from the APM Service list in Kibana.



If someone has any idea what could cause this behaviour, I'd be very grateful for your help. Been trying to solve this for a week already and starting to run out of ideas.

Hello @JamiK, the refresh cache error log in APM Server that you shared could result into unexpected transaction sample rate misconfiguration for the agents. Could you please share the output of the two following requests to confirm that the agents configuration is valid?

POST /.apm-agent-configuration/_search
{
    "_source": ["settings", "service"]
}

GET /.apm-agent-configuration/_mapping

Additionally you can check and edit the APM agent configuration following this guide.

Thanks for the reply!

Curiously it seems that some of our configs have transaction_sample_rate as string, but some as a float:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 16,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": ".apm-agent-configuration",
        "_id": "DMtkL5MBfpZW-3_BkYSL",
        "_score": 1,
        "_source": {
          "service": {
            "name": "<redacted>"
          },
          "settings": {
            "api_request_time": "20s",
            "server_timeout": "30s",
            "transaction_max_spans": "500",
            "transaction_sample_rate": "0.7",
            "api_request_size": "768kb"
          }
        }
      },
      {
        "_index": ".apm-agent-configuration",
        "_id": "UcTeVW4BevojlyjWDSTw",
        "_score": 1,
        "_source": {
          "service": {
            "name": "<redacted>"
            "environment": "production"
          },
          "settings": {
            "transaction_sample_rate": "0.02",
            "capture_body": "all",
            "transaction_max_spans": "500",
            "log_level": "info"
          }
        }
      },
      {
        "_index": ".apm-agent-configuration",
        "_id": "Ptn3VW4B40DewEjs4sNA",
        "_score": 1,
        "_source": {
          "service": {
            "name": "<redacted>"
          },
          "settings": {
            "transaction_sample_rate": 0.1,
            "capture_body": "off",
            "transaction_max_spans": 500
          }
        }
      },
      {
        "_index": ".apm-agent-configuration",
        "_id": "N8z3VW4BXBQSJkofesfW",
        "_score": 1,
        "_source": {
          "service": {
            "name": "<redacted>"
          },
          "settings": {
            "transaction_sample_rate": 0.1,
            "capture_body": "off",
            "transaction_max_spans": 500
          }
        }
      },
      {
        "_index": ".apm-agent-configuration",
        "_id": "UoL4VW4BO3MRhZu9X3Pu",
        "_score": 1,
        "_source": {
          "service": {
            "name": "<redacted>"
          },
          "settings": {
            "transaction_sample_rate": 0.1,
            "capture_body": "off",
            "transaction_max_spans": 500
          }
        }
      },
      {
        "_index": ".apm-agent-configuration",
        "_id": "i5L0gnABG08WBUGQPPD-",
        "_score": 1,
        "_source": {
          "service": {
            "name": "<redacted>"
          },
          "settings": {
            "transaction_sample_rate": 0.1,
            "capture_body": "off",
            "transaction_max_spans": 500
          }
        }
      },
      {
        "_index": ".apm-agent-configuration",
        "_id": "42zygnAB52R7WLXwBT6Z",
        "_score": 1,
        "_source": {
          "service": {
            "name": "<redacted>"
          },
          "settings": {
            "transaction_sample_rate": 0.05,
            "capture_body": "off",
            "transaction_max_spans": 500
          }
        }
      },
      {
        "_index": ".apm-agent-configuration",
        "_id": "YM4jYG4BLD7MAvTp1-Cw",
        "_score": 1,
        "_source": {
          "service": {
            "name": "<redacted>"
          },
          "settings": {
            "transaction_sample_rate": 0.07,
            "capture_body": "off",
            "transaction_max_spans": 500
          }
        }
      },
      {
        "_index": ".apm-agent-configuration",
        "_id": "wCDygnAB7DLZpWpE3Ro0",
        "_score": 1,
        "_source": {
          "service": {
            "name": "<redacted>"
          },
          "settings": {
            "transaction_sample_rate": 0.07,
            "capture_body": "off",
            "transaction_max_spans": 500
          }
        }
      },
      {
        "_index": ".apm-agent-configuration",
        "_id": "iO5vnHEBYgD417la_j6h",
        "_score": 1,
        "_source": {
          "service": {
            "name": "<redacted>"
          },
          "settings": {
            "transaction_sample_rate": 0.1,
            "capture_body": "off",
            "transaction_max_spans": 500
          }
        }
      }
    ]
  }
}

Here's the mapping for the index:

{
  ".apm-agent-configuration": {
    "mappings": {
      "dynamic": "strict",
      "dynamic_templates": [
        {
          "strings": {
            "match_mapping_type": "string",
            "mapping": {
              "ignore_above": 1024,
              "type": "keyword"
            }
          }
        }
      ],
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "agent_name": {
          "type": "keyword",
          "ignore_above": 1024
        },
        "applied_by_agent": {
          "type": "boolean"
        },
        "etag": {
          "type": "keyword",
          "ignore_above": 1024
        },
        "service": {
          "properties": {
            "environment": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "name": {
              "type": "keyword",
              "ignore_above": 1024
            }
          }
        },
        "settings": {
          "dynamic": "true",
          "properties": {
            "api_request_size": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "api_request_time": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "capture_body": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "log_level": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "server_timeout": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "transaction_max_spans": {
              "type": "keyword",
              "ignore_above": 1024
            },
            "transaction_sample_rate": {
              "type": "keyword",
              "ignore_above": 1024
            }
          }
        }
      }
    }
  }
}

The .apm-agent-configuration was actually broken after the migration, with documents having error about the transaction_sample_rate field being malformed. I then re-created it via these steps:

  1. Cloned the index into .apm-agent-configuration-backup
  2. Deleted the .apm-agent-configuration
  3. Restarted Kibana, which re-created the index
  4. Reindexed the existing configs from .apm-agent-configuration-backup

This fixed the field malformed -error and the agents started to receive the configs again successfully, but it didn't fix the original problem.

I found some broken agent-configs from over 5 years ago, which I re-created. This fixed the remote configuration cache refresh problem!

I also found the root cause for the missing transactions. It was caused by the Java agent's Trace continuation strategy config parameter. The default value is continue and that seems to have caused the caller service transaction_sample_rate to dictate the sampling rate of the receiver end, at least for the throughput graphs. By setting the value to restart in the receiving end, the throughput graphs now include the external requests correctly!

Is this behaviour expected? Seems like a big change to how it worked in version 7?

Based on the distributed tracing documentation I understand that the caller service's sample rate decides the downstream sample rate. And this is expected behaviour.

But it seems that this leads to two options:

  • We don't get accurate throughput graphs, but distributed tracing works and we can see the full cross-service tracing in APM-UI
  • We get accurate throughput graphs, but distributed tracing visibility is reduced, as the downstream traces are not visible anymore in the caller's tracing. There is a span-link, but from the UX-perspective this is a lot worse than before

Also, clicking the span link button produces error:

I think the confusion here comes from the change for upsampled transactions in the version 8.0, it's documented here for the apm java agent.

For pre-8.0 servers the agent still records and sends overall time and the result for unsampled transactions, but no context information, labels, or spans. When connecting to 8.0+ servers, the unsampled requests are not sent at all.

Combining with the trace_continuation_strategy it can an effect on the graphs. To read more about this change you could also refer to this specification document.

This looks like a separate issue with how span links are loaded in UI. Could you please share the redacted content of the document for the following request to validate?

GET *traces-apm*/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "trace.id": "270e86bcf7063519fa24e4bcf02af8c0"
          }
        },
        {
          "bool": {
            "should": [
              {
                "term": {
                  "span.id": "dad9f5c2dca42826"
                }
              },
              {
                "term": {
                  "transaction.id": "dad9f5c2dca42826"
                }
              }
            ],
            "minimum_should_match": 1
          }
        }
      ]
    }
  },
  "_source": [
    "span"
  ]
}

Thank you so much for linking that agent spec documentation! It was just the thing I needed. We were finally able to pinpoint the cause for the inaccurate throughput graphs. It was caused by one of our own in-house wrappers for the agent. It mangled the tracecontext-header, so the upstream services didn't propagate the sampling rate correctly to downstream services.

1 Like

Here's the problematic span regarding the span-linking issue:

{
  "took": 491,
  "timed_out": false,
  "_shards": {
    "total": 301,
    "successful": 301,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0,
    "hits": [
      {
        "_index": ".ds-traces-apm-default-2024.11.19-000011",
        "_id": "s_R1SZMBsgdiDBTdc1BN",
        "_score": 0,
        "_source": {
          "span": {
            "duration": {
              "us": 10630
            },
            "representative_count": 1,
            "subtype": "http",
            "destination": {
              "service": {
                "resource": "<redacted>:443",
                "name": "https://<redacted>",
                "type": "external"
              }
            },
            "name": "GET",
            "id": "dad9f5c2dca42826",
            "type": "external"
          }
        }
      }
    ]

I'm glad that the original issue got resolved for you.

With regards to the span-linking issue, I was able to reproduce this issue locally too. In a nutshell APM UI makes an assumption that subtype is always present for linked spans, which might not always be the case.

As a workaround for now until it's fixed, you need to set subtype on the linked spans. For example here's a documentation page that explains how to start a span with java agent public tracing API.

P.S.

Thank you for reporting this to us again. I raised the issue for Kibana here.

2 Likes