Endpoint 7.9 "Degraded and dashboards"

So first things first I know it's beta. But I've come into a few odd ball issues I haven't been able to stumble my way around.

I have the Endpoint installed on a around 50 dev nodes now and I'm seeing a couple common events.
Issue 1:
2020-09-03T13:50:26-07:00: type: 'STATE': sub_type: 'RUNNING' message: Application: endpoint-security--7.9.0[ee6007ad-fb21-4fe6-8c2f-7def704e2e5a]: State changed to DEGRADED:

This follows up directly behind agent failed to check in at least 2 times. Yet the ingest manager shows online.. This might just be a bug.

Issue 2: "Failed enrollments"
Application: endpoint-security--7.9.0[4e934a3f-4910-4e02-8476-c0361287a1b4]: State changed to FAILED: 2 errors occurred: * package 'data\downloads\endpoint-security-7.9.0-windows-x86_64.zip' not found: open data\downloads\endpoint-security-7.9.0-windows-x86_64.zip: The system cannot find the path specified. * fetching package failed: Get "https://artifacts.elastic.co/downloads/endpoint-dev/endpoint-security-7.9.0-windows-x86_64.zip": context deadline exceeded

This is a real issue for me at least. We have several dark networks due to regulations that have no access to the web excluding piggy back connections of course. We are fully on-prim with Elastic. Can we have a local cache setup on the elastic nodes or at least an option to point to a relay?

Issue 3:
This is beta part... I have data being sent to Logs-* which is the default. I do see some information from Elastic.Agent in reference to Endpoint but have no way to see malware notices. The existing 7.8 and prior dashboards defaults all reference Endgame index. When I look at logs-* and selected _index as a column option I see ".ds-logs-elastic.agent.metricbeat-default-000001", ".ds-logs-elastic.agent.metricbeat-default-000001" and ".ds-logs-elastic.agent-default-000001". These are all part of the datastream ingest "amazing read over on github". How would we be able to view the contents current to see if Endgame(Endpoint) is actually working? I dropped known malware on a few test machines and have no idea what is happening. I did not see anything on the local logs to indicate if it was detected.

One issue. When the agent is talking to a host that goes offline for updates or other reason the agent in Ingest manager stays in Offline status yet still is sending data when you check. I have noticed that if you change the kibana instance that the agent talks to it will not update the agents all the time. I had 1 in 20 update to the new address.

Hi @PublicName, sorry its been a few days without you getting a response. Hopefully this answers your questions.

  1. This is a known issue with Endpoint we're working to address. It should eventually resolve itself when it does occur.

  2. Pre-packaging Endpoint with Agent is roadmapped. In the meantime as a workaround you can manually download endpoint-security and place it in the data/downloads directory along with a *.sha512 file for it like Filebeat and Metricbeat have.

  3. While I recommend heading to the Security -> Administration page to see the status of Endpoints, you can look at the indices .ds-metrics-endpoint.*-default-000001 to see the raw status documents for Endpoint. Events that Endpoint collects are sent to the indices .ds-logs-endpoint.events.*-default-000001.

For the other issue, can you please clarify what you mean. Are Agent, Filebeat, Metricbeat, and Endpoint all writing documents into Elasticsearch but appear offline in Ingest Manager?

  1. Sounds good kind of expected it already.
  2. Perfect!!! Now the fun how do the agents plan on receiving updates to known malware hashes and the likes in a dark site? "Or a link to the roadmap"
  3. Just needed to validate that all events are showing from agent to index to SIEM. It still "new" and it's always interesting to see what else is in the works or not implemented yet. I know the Fortinet modules are work in progress for example with only some events showing in SIEM but not all like IPS/AV/DNS filter hits and such. Test machine that had malware on it didn't show up in the SIEM so I need a way to make sure Endpoint is actually working.

2nd post:
After doing the update to 7.9.1 on my cluster the agents are set to use any of the 4 elastic nodes. With Kibana only being a single node and not supporting load balancing yet that may be the issue.

When Kibana came online after the update you go into Ingest/Fleet and all of the agents talking to the last elastic node that went offline will still register as offline. When you click on them you can clearly see they checked in. The offline status stayed for close to an hour at which point I stopped checking for the rest of the day. With a default check in of 1min 30 secounds I would have figured at most 20 minutes to have them all come back as online.

BTW Awesome job to the Elastic Team! I can't want to see what you have decided on for the next update.

Hi @PublicName

Thanks again for trying out the Endpoint Security beta!

I'd like to help out on this issue you listed below:

When Kibana came online after the update you go into Ingest/Fleet and all of the agents talking to the last elastic node that went offline will still register as offline. When you click on them you can clearly see they checked in. The offline status stayed for close to an hour at which point I stopped checking for the rest of the day. With a default check in of 1min 30 secounds I would have figured at most 20 minutes to have them all come back as online.

For my understanding, after you upgraded to 7.9.1, the issue is that your existing Agents stopped reporting that they were online in the UI? So in the list of Agents, all say "Offline" with the grey bubble as opposed to "Online" with the green bubble. However, when you click on the Agent and look at the Activity log, you can see the Agent communicating with Kibana through the logs reported. If this is still persisting, it may be an upgrade issue which we'll take a look at.

Are you seeing any events sends send to the "logs-system* and *metrics-system" data stream? Is there any error in the logs related to talking to kibana?

Not really all saying offline at the same time. Its a 2 part issue which adds confusion. When the agents primary elastic ingest node goes offline kibana will see the agent go offline as well which is expected for a brief moment. Should be about a 1 minutes 30 second default drop until the next check in. The agent logs fail over to another node as expected and you can see them actively by clicking on the agent in the ingest manager while kibana will not pick up up as quickly as expected and the agent shows offline.

It seems to take a fair bit of time after the original node the agent was connected to is restored for kibana to see the agent and mark it online. The entire time logs are still being shipped so it's technically not a loss of data.

If the kibana instance is installed on a node that is also hosting elastic even in coordinating role only and that one goes offline "updates" the agent will take upto an hour to show as online again yet the logs will still be updating.

Hope that makes sense. Logs are still being sent. Kibana does not update status quickly.

Agent to Kibana. Elastic logs are fine.

This is also not an ideal setup with agents only talking to 1 kibana instance currently. Load balancing not supported at least in the 7.9 notes.

So after looking back at a few machines that did drop offline again today which did come online after several hours. Several agents are from the last host that went down for updates. But not all as I have 4 nodes they can connect to. Only some of the agents have just stopped. The only thing I can see is the 2020-09 updates where just applied to all of the machines offline but others have 2020-09 and they work...

I started looking at the last logs as they started dropping offline in kibana and then stopped sending logs altogether today. Some as of a few minutes ago. This was not expected...

image This is default config out of the box no changes to system yet. Each of the offline nodes had high memory usage.

Metricbeat bundled with Endpoint. After the Elastic-Agent and Elastic-Endpoint is stopped it still runs along with Filebeat. I have disabled the standalone Metricbeat on the endpoints in question for testing just to see prior.

One of the last logs in the the ingest manager:
"malware": {
"concerned_actions": [
"status": "failure"
"streaming": {
"concerned_actions": [
"status": "success"
"status": "failure"

From the windows application log:
Faulting application name: elastic-endpoint.exe, version:, time stamp: 0x5f32bdd7
Faulting module name: elastic-endpoint.exe, version:, time stamp: 0x5f32bdd7
Exception code: 0xc0000005

From the local machine endpoint-xxx.log from the same machine the above logs and snip is from:
{"@timestamp":"2020-09-10T23:22:32.95811900Z","agent":{"id":"removed","type":"endpoint"},"ecs":{"version":"1.5.0"},"log":{"level":"info","origin":{"file":{"line":1392,"name":"HttpLib.cpp"}}},"message":"HttpLib.cpp:1392 Establishing GET connection to [https://node3:9200/_cluster/health]","process":{"pid":5496,"thread":{"id":2140}}}
{"@timestamp":"2020-09-10T23:22:32.95811900Z","agent":{"id":"71cfd898-0cf9-47c5-a97d-bb8f3f3b1f9a","type":"endpoint"},"ecs":{"version":"1.5.0"},"log":{"level":"notice","origin":{"file":{"line":65,"name":"BulkQueueConsumer.cpp"}}},"message":"BulkQueueConsumer.cpp:65 Elasticsearch connection is down","process":{"pid":5496,"thread":{"id":2140}}}

If you want more detailed logs just tell me what you need. I will PM them underacted as I have a decent test bed to pull from. If you want a alpha genie pig for a beta I'll do that as well.

@pierhugues @Kevin_Logan

Agent version 7.9.1 fixed some of the disconnect issues. Not sure what changed "haven't looked" I still need to do the full restart on the cluster in a not safe order to replicate a failure to see it if fixed the reconnect after a machine is off for an extended period.

What would be really nice is to put a limit on is the memory that Filebeat/Metricbeat can use. I'm seeing it consume 8Gb Ram which is max we have on some machines. It's reasonable for it to take up 512Mb maybe 1Gb on a larger file but anything past is detrimental to the user experience.

It has forced a soft crash on a few lower end machines I have due to it consuming everything the machine had. Any chance of putting a cap or allowing a cap to be configured in the policy for a participial group?

For example some Windows 10 1st gen embedded devices we have only have 4Gb ram on them. That is very tight for 10 even on a good day. Having something chunk 1Gb would hurt those devices for a good long while as they normally have very low end CPU's.

After additional testing of 7.9.1 I have some more concerns. Default memory usage is really high but only initially. It does drop after a few depending on file count. There is no obvious easy spot to change the settings in the ingest manager. I'm only using ingest manager for all testing not going outside of it or modifying yml files directly for testing. I have not been able to test the snapshot of 7.10 yet so forgive me if this has already been addressed.

I'm still seeing degraded messages now it's exclusive to Endpoint.

"malware": {
"concerned_actions": [
"status": "failure"

Essentially want Endpoint is doing 7.9.1 is nothing at all. I dropped 40 know malware variants on a machine not more then an hour ago knowing that it wouldn't catch any of them due to the failures indicated in the logs above.

Anything I can try to see if Endpoint will actually work or standby for the next update?

Hi @PublicName. Since you don't mind, can you PM me the Endpoint logs (c:\Program Files\Elastic\Endpoint\ state\log*), the Endpoint's config (c:\Program Files\Elastic\Endpoint\elastic-endpoint.yaml), and the Endpoint's latest payload response from applying that config/policy (I mean the full "degraded message" you shared a snippet of). I'll look through them to see if I can determine what may be causing your failures.

As requested you have 3 PM's due to length exceeding the 13000chr limit. Unable to attach to the message due to it not being a jpg, jpeg, png or gif.

I hope you see something I missed. I'm at a loss. The logs are repeatable on workgroup machines directly off the Microsoft ISO files for home and pro versions. From 1809 to 2004 I end up with the same message on each. It appears the driver never fully loads. The same happens on 2012R2 up to 2019 server as well. The disconnect message appears on multi clusters some being on the same layer2 network and same switch.

I do see that the drive filter you are attempting to load is signed by Elastic and Microsoft. After looking a random sampling on my workstation and a spare laptop I check the drivers for vendors like AMD, Nvidia, Dell and the only signer is Microsoft for the hardware compatibility publisher or themselves off of a trusted root like symantec or digicert "same ca you use". Not sure if that matters as I'm grasping for straws. Really not worth me doing any debugging when you awesome folks are already well ahead of us users.

Thank you!

I looked through them and I see the issue with the Policy failure. If you go into the Security App's Administration tab and click on the "Configuration Status" for the failing host you should see a dialog pop up on the right side of the screen that lets you drill down into the policy and see the failure in a nice UI.

But, since you shared the payload document for Endpoint from Ingest Manager I'll describe how to interpret it. The relevant portion is the Endpoint.policy.applied.actions array. One of them contains a failure (download_user_artifacts), which means the reason your Endpoint is failing to download artifacts it needs from Kibana (since the only artifacts a 7.9 Endpoint uses are exceptionlists it's clear that is the artifact Endpoint cannot download).

The section you'd previously shared a snipped of was from Endpoint.policy.actions.configurations. The way to think of these two sections (actions and configurations) is that when Endpoint applies policy it does many "actions" (e.g. download user artifacts, connect to the kernel driver, etc) for the higher level "configurations" (prevent malware, collect process events, etc). The actions array lists the things Endpoint failed or succeeded in doing, the configurations portion maps those actions to the configurations they are relevant to. Hopefully that makes sense.

Can you look in the Endpoint logs to see why user artifacts are failing to download? The elastic-endpoint.yaml file contains information on the artifacts that are downloaded. If you search for the relative URL (/api/endpoint/artifacts/download/endpoint-exceptionlist-windows-v1) in Endpoint's logs you should hopefully see some log messages that point you to the issue. In this case since you've previously had issues with Kibana connections from Agent I suspect something similar is happening here.

I'm not sure why this failure would cause Endpoint to fail to detect the malware samples you tested. I'd be happy to work through that too but we should get your Endpoint in a good working state before diving into that.

Actively log for the agent? I don't happen to see security admin or configuration status.

Exception list is empty on all clusters as the option to add "save" exception is grayed out on each one for some reason. If I attempt to add an endpoint exemption nothing will populate. If I manually type I'm unable to save. I can add a rule exemption as a test to see if it will download and have success vs failure.

Results from URL. Considering the API is used it's a little harder to check. Using the elastic user ends with:

Going back 1 level to see if I would get a file list:
{"statusCode":404,"error":"Not Found","message":"Not Found"}

Well that would a good reason to fail...

Is Endpoint setup like Carbon Black or Cylance where it's only active on runtime? That would explain some of it. I did use the good old metaspolit as well on a unpatched box. Guess we should start with the little things first as you said.

I'll let you know as soon as I get the rule added and tested on a few machines. See if I end up with the same or different results.

Unable to add any exceptions as I'm unable to save so I wasn't able to test if just creating it would do the trick.

7.9.2 hasn't resolved the issue with the exemption list. It still has the random disconnects that where it will say Elasticsearch connection is down. To be honest it almost looks like a permissions issue with the API that gets generated as it's failing to read the cluster/health status.

Let me ask you something. Can you have Username/Password + API keys enabled on the same cluster? Seeing as I ran into this a few weeks back before the ingest manager and was to lazy to do much past try it a few times. I never did get both of them working at the same time.

Now we know ElasticSearch is fine with both as we get logs from the agents. Is Kibana fine with it?

Sorry it's been a few days without a reply.

Given that it seems like you're having connectivity issues, let's work through your networking. There are two connections we need to validate for Endpoint. For both, API key authentication takes precedence over username/password authentication if both are in Endpoint's config.

Given that you hit errors in the past it's best to start with a fresh Agent and Endpoint install if possible so there is less in the Endpoint logs to go through. It would be helpful to know which connection is not working and what errors you're seeing.

Note that in the example commands below some specifics, like the API keys and URLs, will of course be different for you. Also not all of the commands below are a part of a native Windows installation. Hopefully if the exact commands don't work for you you'll be able to figure out some variant of them that works on your computer; if not just ping back and we'll find a different command together.

Connection to Kibana
Endpoint connects to Kibana to download potentially large artifacts it needs to fully apply the policy. For example, for 7.9 this is how Endpoint downloads the Alert Exceptions to apply on macOS and Windows.

In Endpoint's config (c:\Program Files\Elastic\Endpoint\elastic-endpoint.yaml) you should see a snippet that looks like this:

    access_api_key: BASE64VALUE
      host: example.com
      protocol: https
  - artifact_manifest:
          relative_url: /api/endpoint/artifacts/download/endpoint-exceptionlist-windows-v1/d801aa1fb7ddcc330a5e3173372ea6af4a3d08ec58074478e85aa5603e926658

Based on that you can search Endpoint's logs for the relative_url to see what happened when Endpoint tried to download the artifact. On my machine these are the logs I see.

C:\WINDOWS\system32>grep endpoint-exceptionlist-windows-v1 "c:\Program Files\Elastic\Endpoint\state\log\endpoint-000000.log"
{"@timestamp":"2020-09-29T21:26:10.70243100Z","agent":{"id":"4b707d92-f692-4d70-9251-fa99fa06435c","type":"endpoint"},"ecs":{"version":"1.5.0"},"log":{"level":"info","origin":{"file":{"line":2241,"name":"Artifacts.cpp"}}},"message":"Artifacts.cpp:2241 Downloading artifact: endpoint-exceptionlist-windows-v1","process":{"pid":9832,"thread":{"id":2232}}}
{"@timestamp":"2020-09-29T21:26:10.70243100Z","agent":{"id":"4b707d92-f692-4d70-9251-fa99fa06435c","type":"endpoint"},"ecs":{"version":"1.5.0"},"log":{"level":"info","origin":{"file":{"line":1440,"name":"HttpLib.cpp"}}},"message":"HttpLib.cpp:1440 Establishing GET connection to [https://example.com:443/api/endpoint/artifacts/download/endpoint-exceptionlist-windows-v1/d801aa1fb7ddcc330a5e3173372ea6af4a3d08ec58074478e85aa5603e926658]","process":{"pid":9832,"thread":{"id":2232}}}
{"@timestamp":"2020-09-29T21:26:10.32287000Z","agent":{"id":"4b707d92-f692-4d70-9251-fa99fa06435c","type":"endpoint"},"ecs":{"version":"1.5.0"},"log":{"level":"info","origin":{"file":{"line":497,"name":"Artifacts.cpp"}}},"message":"Artifacts.cpp:497 Artifact endpoint-exceptionlist-windows-v1 successfully verified","process":{"pid":9832,"thread":{"id":2232}}}

Further, you can use Curl to manually try to download the same artifact. Make sure to pipe the value to something like xxd since the content downloaded isn't text.

C:\WINDOWS\system32>curl -H "Authorization: ApiKey BASE64VALUE" https://example.com:443/api/endpoint/artifacts/download/endpoint-exceptionlist-windows-v1/d801aa1fb7ddcc330a5e3173372ea6af4a3d08ec58074478e85aa5603e926658 | xxd
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    22  100    22    0     0     22      0  0:00:01 --:--:--  0:00:01    41
00000000: 789c ab56 4acd 2b29 ca4c 2d56 b28a 8ead  x..VJ.+).L-V....
00000010: 0500 2719 0529                           ..'..)


Connection to Elasticsearch
Endpoint connects to Elasticsearch to store data that it generates.

In Endpoint's config file you should see a snippet that looks like this:

    api_key: raw:value
      - https://example.com:443

Based on that you can search Endpoint's logs to see what happens when it checks to see if it can send to Elasticsearch. If after checking the cluster health it sends data to the _bulk API then it is able to send data.

C:\WINDOWS\system32>grep -A 1 "_cluster/health" "c:\Program Files\Elastic\Endpoint\state\log\endpoint-000000.log"
{"@timestamp":"2020-09-29T21:26:16.38473600Z","agent":{"id":"4b707d92-f692-4d70-9251-fa99fa06435c","type":"endpoint"},"ecs":{"version":"1.5.0"},"log":{"level":"info","origin":{"file":{"line":1440,"name":"HttpLib.cpp"}}},"message":"HttpLib.cpp:1440 Establishing GET connection to [https://example.com:443/_cluster/health]","process":{"pid":9832,"thread":{"id":9352}}}
{"@timestamp":"2020-09-29T21:26:16.45341500Z","agent":{"id":"4b707d92-f692-4d70-9251-fa99fa06435c","type":"endpoint"},"ecs":{"version":"1.5.0"},"log":{"level":"info","origin":{"file":{"line":1440,"name":"HttpLib.cpp"}}},"message":"HttpLib.cpp:1440 Establishing POST connection to [https://example.com:443/_bulk]","process":{"pid":9832,"thread":{"id":9352}}}


You can also search for "documents to Elasticsearch" to see how many documents Endpoint is periodically sending.

C:\WINDOWS\system32>grep "documents to Elasticsearch" "c:\Program Files\Elastic\Endpoint\state\log\endpoint-000000.log" | head -n 4
{"@timestamp":"2020-09-29T21:26:17.55295400Z","agent":{"id":"4b707d92-f692-4d70-9251-fa99fa06435c","type":"endpoint"},"ecs":{"version":"1.5.0"},"log":{"level":"info","origin":{"file":{"line":180,"name":"BulkQueueConsumer.cpp"}}},"message":"BulkQueueConsumer.cpp:180 Sent 8 documents to Elasticsearch","process":{"pid":9832,"thread":{"id":9352}}}
{"@timestamp":"2020-09-29T21:28:11.49117600Z","agent":{"id":"4b707d92-f692-4d70-9251-fa99fa06435c","type":"endpoint"},"ecs":{"version":"1.5.0"},"log":{"level":"info","origin":{"file":{"line":180,"name":"BulkQueueConsumer.cpp"}}},"message":"BulkQueueConsumer.cpp:180 Sent 1 documents to Elasticsearch","process":{"pid":9832,"thread":{"id":9352}}}
{"@timestamp":"2020-09-29T21:28:13.63456900Z","agent":{"id":"4b707d92-f692-4d70-9251-fa99fa06435c","type":"endpoint"},"ecs":{"version":"1.5.0"},"log":{"level":"info","origin":{"file":{"line":180,"name":"BulkQueueConsumer.cpp"}}},"message":"BulkQueueConsumer.cpp:180 Sent 227 documents to Elasticsearch","process":{"pid":9832,"thread":{"id":9352}}}
{"@timestamp":"2020-09-29T21:30:11.2557800Z","agent":{"id":"4b707d92-f692-4d70-9251-fa99fa06435c","type":"endpoint"},"ecs":{"version":"1.5.0"},"log":{"level":"info","origin":{"file":{"line":180,"name":"BulkQueueConsumer.cpp"}}},"message":"BulkQueueConsumer.cpp:180 Sent 1 documents to Elasticsearch","process":{"pid":9832,"thread":{"id":9352}}}


From the configuration file snippet you can also generate a Curl request to see what happens when you manually try to connect to Elasticsearch. Notice that before using Curl you must base 64 encode the api_key value.

C:\WINDOWS\system32>python3 -c "import base64; print(base64.b64encode('raw:value'.encode('utf-8')))"

C:\WINDOWS\system32>curl -H "Authorization: ApiKey cmF3OnZhbHVl" https://example.com:443/_cluster/health

I will fire up a test VM straight of the ISO and see what I get as soon as I can get to it might be a few days. The errors are off of several machines try 50+ test machines none of them are cloned all are direct windows wim installs. I get valid data from metrics and from filebeat that is useable from all agents even the ones that says failed. It's not an error I can reproduce at will so it's hard to track down. What I don't get and several other people on the forums are not getting are endpoint malware events. They are never sent to the elastic even with the 7.9.2 agent. I haven't been able to test the snapshot version yet.

The failed connection in my use case lines up with the degraded messages that show up on the kibana fleet manager part.

For the Malware alert issue, have you tried testing with a version of Mimikatz? Endpoint detects malware when it is written or executed but not if it is just sitting on the filesystem.

Can you go to Security->Administration and make sure the Policy for your Endpoint is in a green/sucess state. If it isn't you can click on the status and a dialog will appear on the right showing what worked and didn't work. Please share what isn't working.

Assuming malware detection is working, can you see if running or copying Mimikatz on the C:\drive generates an alert?