APM UI Kibana Internal Server Error

Kibana version : 7.5.2

Elasticsearch version : 7.5.2

APM Server version : 7.6.0

APM Agent language and version : Java 1.12.0

We have 6 ES data instances each with 7 VCPUs and 62 GIB of RAM, with 23 GIB of HEAP.

We are using EBS volumes and they are also functioning way below their max capacity.

Kibana APM UI throws "Error: Request Timeout after 30000ms" for "1 day" worth data but the "Discovery" section of the Kibana can load 7 days worth of data which is around 1.6 Billion events. Our cluster health is normal and it is not overutilized.

Even the Kibana pod is not overutilized

"Internal Sever Error" in Kibana UI for APM:

Discovery section in Kibana can load 7 days worth documents in less than 30 seconds:

I tried the same query in profiler and I got a response in less than 17 seconds.
All the data, master and kibana nodes are working normally.
I've also tried "client_max_body_size 5000m;" in our nginx reverse proxy and it did not fix the problem.

  • Thank you

Hi there,

Thanks for your interest in Elastic APM and making us aware of this issue.

This doesn't sound right and I'd like to dig a little deeper into this. In the APM app in Kibana we have a debugging mechanism that lets you dump the raw Elasticsearch query. I'd like you to try this.

Open your browsers Developer Tools and find the XHR request that starts with /api/apm/services. You can enable debugging by adding _debug=true to the query params like:
/api/apm/services?_debug=true

This will dump the ES query in the Kibana logs.

With the ES query you can now execute it directly in the Kibana Dev Tools. I would expect it to still timeout.
You might find the profile API useful in this regard.

Very interested to hear what you find out.

Hi @sqren,

Thank you for your reply.
The indices were deleted as a part of ILM.

I'm running more load tests to generate data to reproduce the error.
In the mean time please let me know why changing the transaction_sample_rate from 1 to 0.00002 doesn't seem to have much effect on disk space usage.

when the rate was set to 1 we ingested 1GB of data every 25 seconds and when the rate was set to 0.00002 we are ingesting 1GB every 1.5 minutes. I've also stopped ingesting headers and body as well. FYI: we are using elastic APM java agent with around 20 services and we get around 2 million requests every minute. Please let me know if I'm missing something else which is important to save disk space.

  • Thank you

Hi @sqren,

I generated more data and found the query in kibana logs for 3 days worth data which looks like this
--DEBUG ES QUERY--
GET /api/apm/services {"_debug":"true","start":"2020-03-12T18:03:29.441Z","end":"2020-03-15T18:03:29.441Z","uiFilters":"{}"}
GET apm-,apm-,apm-*/_search
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"terms": {
"processor.event": [
"transaction",
"error",
"metric"
]
}
},
{
"range": {
"@timestamp": {
"gte": 1584036209441,
"lte": 1584295409441,
"format": "epoch_millis"
}
}
},
{
"range": {
"observer.version_major": {
"gte": 7
}
}
}
]
}
},
"aggs": {
"services": {
"terms": {
"field": "service.name",
"size": 500
},
"aggs": {
"avg": {
"avg": {
"field": "transaction.duration.us"
}
},
"agents": {
"terms": {
"field": "agent.name",
"size": 1
}
},
"events": {
"terms": {
"field": "processor.event",
"size": 2
}
},
"environments": {
"terms": {
"field": "service.environment"
}
}
}
}
}
}
Weird thing is that the query is timing out in the search profiler but not in the APM UI.

This is because the transaction_sample_rate only affects to apm-span-* index.
It stores all transactions to the apm-transaction index no matter what the transaction_sample_rate is set.

I think this is the real reason why APM UI is that SLOW, and since it's architectural problem this slowness issue is hard to solve with some tuning.

If this is true, it should be very careful to apply Elastic APM to real operation environment.

I also captured the slow query HERE and replied HERE as sqren asked 20 days ago and still waiting his answer.

Yeah, I agree, cutting down transaction sample can only limit disk space to a certain limit.
I'm able to read filebeat logs much faster than APM even though they take more disk space.

APM is timing out even though CPU usage is very low. When I search for APM data for like 2 days it takes more than 30-40 seconds and also it doesn't use all the cores. Search queue is barely used.

I also noticed that cpu usage is very low and also IOPS(I/O per seconds) is not that high.

Yeah, I see that too, CPU is very low for "SEARCHES" and response times are quite slow in the APM UI. I did see some improvement when I doubled the amount of vCPUs (also cut down the RAM by 50%, basically switched from r5.2xlarge to c5.4xlarge). But I still feel that APM queries doesn't use the full capacity of our cluster for search queries.

We have 6 provisioned iops volumes with with 3000 IOPS but the peaks I'm seeing is less than 500 IOPS

one more issue I see is Kibana visualizations also take quite some time. Our kibana server is way underutilized. When I click on APM dashboard in Kibana it took around 35 seconds for the visualization to load.
This might be a caching issue with Kibana though, after opening the APM UI for the first time, if I load a second tab it doesn't take 35 seconds.

Hi @sqren,

I generated enough data to reproduce the error again.
I'm seeing timeout in the dev tools for a single day worth of logs. But I'm able to load all the logs in the discovery section easily for 2 days. Is there anything we can do make this run faster. CPU usage is less than 40% on all 6 nodes, some of them have less than 25%. Can we do something so that the nodes use full hardware capacity?

We have 6 nodes and each index has 4 shards, total size of the indices in the last 2 days is 600GB (including 1 replica).

FYI: Please note that we use a single "apm-YYYY.MM.DAY" index to store all of APM data like transactions, errors and spans.

  • Thank you

FYI: Please note that we use a single "apm-YYYY.MM.DAY" index to store all of APM data like transactions, errors and spans.

Here are some slow logs when ES timed out while searching for 12 hours of APM logs

{"type": "index_search_slowlog", "timestamp": "2020-03-16T19:09:47,438Z", "level": "WARN", "component": "i.s.s.query", "cluster.name": "elasticsearch", "node.name": "elasticsearch-es-data-4", "message": "[apm-7.6.0-2020.03.16][3]", "took": "45.5s", "took_millis": "45508", "total_hits": "10000+ hits", "stats": "[]", "search_type": "QUERY_THEN_FETCH", "total_shards": "12", "source": "{\"size\":0,\"query\":{\"bool\":{\"filter\":[{\"terms\":{\"processor.event\":[\"transaction\",\"error\",\"metric\"],\"boost\":1.0}},{\"range\":{\"@timestamp\":{\"from\":1584342540477,\"to\":1584385740477,\"include_lower\":true,\"include_upper\":true,\"format\":\"epoch_millis\",\"boost\":1.0}}},{\"range\":{\"observer.version_major\":{\"from\":7,\"to\":null,\"include_lower\":true,\"include_upper\":true,\"boost\":1.0}}}],\"adjust_pure_negative\":true,\"boost\":1.0}},\"aggregations\":{\"host\":{\"meta\":{},\"filter\":{\"match_all\":{\"boost\":1.0}},\"aggregations\":{\"by_terms\":{\"terms\":{\"field\":\"host.hostname\",\"size\":10,\"min_doc_count\":1,\"shard_min_doc_count\":0,\"show_term_doc_count_error\":false,\"order\":[{\"_count\":\"desc\"},{\"_key\":\"asc\"}]},\"aggregations\":{\"bucket_count\":{\"cardinality\":{\"field\":\"service.name\"}}}}}},\"agentName\":{\"meta\":{},\"filter\":{\"match_all\":{\"boost\":1.0}},\"aggregations\":{\"by_terms\":{\"terms\":{\"field\":\"agent.name\",\"size\":10,\"min_doc_count\":1,\"shard_min_doc_count\":0,\"show_term_doc_count_error\":false,\"order\":[{\"_count\":\"desc\"},{\"_key\":\"asc\"}]},\"aggregations\":{\"bucket_count\":{\"cardinality\":{\"field\":\"service.name\"}}}}}}}}", "cluster.uuid": "jktKncXUQ_yiDkjUepqUCg", "node.id": "yCTfEa-nSJi5QhH-W8Praw"  }

{"type": "index_search_slowlog", "timestamp": "2020-03-16T19:09:47,424Z", "level": "WARN", "component": "i.s.s.query", "cluster.name": "elasticsearch", "node.name": "elasticsearch-es-data-4", "message": "[apm-7.6.0-2020.03.16][3]", "took": "45.4s", "took_millis": "45426", "total_hits": "10000+ hits", "stats": "[]", "search_type": "QUERY_THEN_FETCH", "total_shards": "12", "source": "{\"size\":0,\"query\":{\"bool\":{\"filter\":[{\"terms\":{\"processor.event\":[\"transaction\",\"error\",\"metric\"],\"boost\":1.0}},{\"range\":{\"@timestamp\":{\"from\":1584342540477,\"to\":1584385740477,\"include_lower\":true,\"include_upper\":true,\"format\":\"epoch_millis\",\"boost\":1.0}}},{\"range\":{\"observer.version_major\":{\"from\":7,\"to\":null,\"include_lower\":true,\"include_upper\":true,\"boost\":1.0}}}],\"adjust_pure_negative\":true,\"boost\":1.0}},\"aggregations\":{\"services\":{\"terms\":{\"field\":\"service.name\",\"size\":500,\"min_doc_count\":1,\"shard_min_doc_count\":0,\"show_term_doc_count_error\":false,\"order\":[{\"_count\":\"desc\"},{\"_key\":\"asc\"}]},\"aggregations\":{\"avg\":{\"avg\":{\"field\":\"transaction.duration.us\"}},\"agents\":{\"terms\":{\"field\":\"agent.name\",\"size\":1,\"min_doc_count\":1,\"shard_min_doc_count\":0,\"show_term_doc_count_error\":false,\"order\":[{\"_count\":\"desc\"},{\"_key\":\"asc\"}]}},\"events\":{\"terms\":{\"field\":\"processor.event\",\"size\":2,\"min_doc_count\":1,\"shard_min_doc_count\":0,\"show_term_doc_count_error\":false,\"order\":[{\"_count\":\"desc\"},{\"_key\":\"asc\"}]}},\"environments\":{\"terms\":{\"field\":\"service.environment\",\"size\":10,\"min_doc_count\":1,\"shard_min_doc_count\":0,\"show_term_doc_count_error\":false,\"order\":[{\"_count\":\"desc\"},{\"_key\":\"asc\"}]}}}}}}", "cluster.uuid": "jktKncXUQ_yiDkjUepqUCg", "node.id": "yCTfEa-nSJi5QhH-W8Praw"  }

AS you can see in the screenshots below CPU is not spiked at all, search queue is completely empty and there are no rejects in the queue.

Screen Shot 2020-03-16 at 12.11.09 PM

Hi,

Thanks for taking the time to try this out. I see you are getting a "504 Gateway Timeout". How long before it times out?

A couple of ideas:

  1. Can you try connecting to elasticsearch directly, and not through the nginx proxy. Do you get a different result or different error message?

  2. Try removing some of the aggregations in the aggs section of the query, so it looks like:

{
  "size": 0,
  "query": { ...same query },
  "aggs": {
    "services": {
      "terms": {
        "field": "service.name",
        "size": 500
      }
    }
  }
}

Does this make the query complete? If yes, try re-adding some of the aggs and find out which one that causes the timeout.

The default timeout period we have is 30 seconds.
We don't have an nginx in front of elasticsearch, our kibana talks to elasticsearch directly, (FYI: we do have nginx and oauth2-proxy in front of kibana though)
I don't see timeouts when I remove some aggregations. Please check the below screenshots

When I removed all aggregations for 3 days worth of data for a single service.:


When I removed aggregations for transaction_results:

When I removed aggregations for response_times:

Can you try to do the same with the /api/apm/services endpoint?
It's a little simpler and in your initial post it sounded like that was also timing out

... it's on the very first page of APM ui. The page that shows a list of services.

Hello @sqren

Could you let us know any benchmarking result of APM UI done by development team internally?

I found THIS GUIDE and THIS ISSUE which only focused on APM Server side.

This slowness issue occur with ONLY SMALL amount of requests, and It's really hard to tune Elasticsearch for APM UI.

In my point of view, it should be easier for Elastic APM to be widely used.

I tried to load 3 times
FIRST TIME (APM UI failed):


SECOND TIME (Successfully loaded in the console):

THIRD TIME (partially loaded in the APM UI, errors are not captured in the screenshot):

FYI: Loading in the Discovery section is the fastest and it never fails

FIRST TIME (APM UI failed):

Do you see any errors in the Kibana logs or the Chrome Dev tools?