OSQuery Live Queries don't go through

Hey,

I'm using a 7.13 stack with 7.13 Elastic Agents on multiple machines. I have the OSQuery Manager Integration installed (v. 0.2.3) and applied a respective policy to three Agents (2x Ubuntu 20.04, 1x Windows Server 2019).

When I run live queries, they either don't reach the targets or don't come back to Kibana (stuck on pending) - I'm not sure what's happening and would need some assistance as to where to start debugging.

On one Linux host, queries sometimes work, on the other two hosts they have never come back successfully, ever. I'm using the simple query "SELECT * FROM os_version;".
I have confirmed that osqueryd and osquerybeat are running on the hosts. The hosts also appear "green" in the live query host selector.

Some log files from the host that does work occasionally:

<install_path>/data/elastic-agent-054e22/install/osquerybeat-7.13.0-linux-x86_64/osquery/o squeryd.INFO

I0527 17:27:18.444630 702059 eventfactory.cpp:156] Event publisher not enabled: BPFEventPublisher: Publisher disabled via configuration
I0527 17:27:18.444713 702059 eventfactory.cpp:156] Event publisher not enabled: auditeventpublisher: Publisher disabled via configuration
I0527 17:27:18.444742 702059 eventfactory.cpp:156] Event publisher not enabled: inotify: Publisher disabled via configuration
I0527 17:27:18.444759 702059 eventfactory.cpp:156] Event publisher not enabled: syslog: Publisher disabled via configuration

there's nothing else in this logfile.

<install_path>/data/elastic-agent-054e22/install/osquerybeat-7.13.0-linux-x86_64/osquery/osqueryd.results.log is empty on every host.

Thanks @nemhods for raising the issue. Sorry for your unexpected experience with osquery manager.

Could you try running a "Select * from users;" to see if anything comes back from that?

@nemhods thank you for giving our first beta a try.

There are couple of files that can give us more information on what's going on, the elastic-agent log and the osquerybeat log, should be in data directory, something like:
data/elastic-agent-054e22/logs/elastic-agent-json.log-*
and
/data/elastic-agent-054e22/logs/default/osquerybeat-json.log

You can also change the agent logging level to debug to get more details. In order to do that you click on specific agent on the fleet page, choose the "Logs" tab, at the bottom of the tab "Agent logging level" select "debug", apply changes.
This will update agent.logging.level in the fleet.yml file and set the logging level to "debug".

Hey,

I have a few updates, though nothing really stands out. I uncovered more issues, maybe there is something more fundamental wrong with my install. I will mention all these, because for one of you all the puzzle pieces might yield a whole picture. Apologies if this makes this issue more complex...

  1. First of all, thanks for pointing me to the logging level setting for agents - I did not know this existed. Is there a way to set this globally on an Agent Policy?! Also, when i close the page and re-open it, the logging setting goes back to "info" - is this only a local filter, or does it actually instruct the agent to change its log level for all integrations?
  2. The osquerybeat logs on the machines themselves are empty (minus the regular "non-zero metrics" loglines). cat data/elastic-agent-054e22/logs/default/osquerybeat-json.log* | grep -v "Non-zero" returns nothing.
  3. I'm seeing a weird behavior with the fleet server. When I update a policy, it will show the hosts as "out of date", even though they appear "green" and check in regularly. I had to restart the fleet server in order for the changes to go through. Are the OSQuery queries relayed through fleet server? I'm getting the feeling that the fleet server is the culprit here. The fleet server does not produce any log lines besides starting with
fleet_server_1  | 2021-05-29T14:44:56.697Z      INFO    warn/warn.go:18 The Elastic Agent is currently in BETA and should not be used in production
fleet_server_1  | 2021-05-29T14:44:56.697Z      INFO    application/application.go:68   Detecting execution mode
fleet_server_1  | 2021-05-29T14:44:56.698Z      INFO    application/application.go:93   Agent is managed by Fleet
fleet_server_1  | 2021-05-29T14:44:56.698Z      INFO    capabilities/capabilities.go:59 capabilities file not found in /usr/share/elastic-agent/state/capabilities.yml
fleet_server_1  | 2021-05-29T14:44:56.949Z      INFO    [composable]    composable/controller.go:46     EXPERIMENTAL - Inputs with variables are currently experimental and should not be used in production
fleet_server_1  | 2021-05-29T14:44:57.052Z      INFO    [composable.providers.docker]   docker/docker.go:43     Docker provider skipped, unable to connect: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
fleet_server_1  | 2021-05-29T14:44:57.054Z      INFO    store/state_store.go:327        restoring current policy from disk
fleet_server_1  | 2021-05-29T14:44:57.060Z      INFO    stateresolver/stateresolver.go:48       New State ID is hH81eU-q
fleet_server_1  | 2021-05-29T14:44:57.060Z      INFO    stateresolver/stateresolver.go:49       Converging state requires execution of 2 step(s)
fleet_server_1  | 2021-05-29T14:44:57.100Z      INFO    operation/operator.go:259       operation 'operation-install' skipped for fleet-server.7.13.0
fleet_server_1  | 2021-05-29T14:44:57.306Z      INFO    log/reporter.go:40      2021-05-29T14:44:57Z - message: Application: fleet-server--7.13.0[e03065cf-b804-4c2a-85aa-0c034ff3d6ed]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
fleet_server_1  | 2021-05-29T14:44:57.486Z      INFO    operation/operator.go:259       operation 'operation-install' skipped for filebeat.7.13.0
fleet_server_1  | 2021-05-29T14:44:57.763Z      INFO    log/reporter.go:40      2021-05-29T14:44:57Z - message: Application: filebeat--7.13.0--36643631373035623733363936343635[e03065cf-b804-4c2a-85aa-0c034ff3d6ed]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
fleet_server_1  | 2021-05-29T14:44:58.018Z      INFO    operation/operator.go:259       operation 'operation-install' skipped for metricbeat.7.13.0
fleet_server_1  | 2021-05-29T14:44:58.335Z      INFO    log/reporter.go:40      2021-05-29T14:44:58Z - message: Application: fleet-server--7.13.0[e03065cf-b804-4c2a-85aa-0c034ff3d6ed]: State changed to RUNNING: Running on default policy with Fleet Server integration - type: 'STATE' - sub_type: 'RUNNING'
fleet_server_1  | 2021-05-29T14:44:58.820Z      INFO    log/reporter.go:40      2021-05-29T14:44:58Z - message: Application: metricbeat--7.13.0--36643631373035623733363936343635[e03065cf-b804-4c2a-85aa-0c034ff3d6ed]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
fleet_server_1  | 2021-05-29T14:44:58.822Z      INFO    stateresolver/stateresolver.go:66       Updating internal state
fleet_server_1  | 2021-05-29T14:44:58.822Z      WARN    [tls]   tlscommon/tls_config.go:98      SSL/TLS verifications disabled.
fleet_server_1  | 2021-05-29T14:44:58.827Z      INFO    stateresolver/stateresolver.go:48       New State ID is hH81eU-q
fleet_server_1  | 2021-05-29T14:44:58.827Z      INFO    stateresolver/stateresolver.go:49       Converging state requires execution of 0 step(s)
fleet_server_1  | 2021-05-29T14:44:58.827Z      INFO    stateresolver/stateresolver.go:66       Updating internal state
fleet_server_1  | 2021-05-29T14:44:58.931Z      INFO    log/reporter.go:40      2021-05-29T14:44:58Z - message: Application: filebeat--7.13.0--36643631373035623733363936343635[e03065cf-b804-4c2a-85aa-0c034ff3d6ed]: State changed to RUNNING: Running - type: 'STATE' - sub_type: 'RUNNING'
fleet_server_1  | 2021-05-29T14:44:59.156Z      INFO    [api]   api/server.go:62        Starting stats endpoint
fleet_server_1  | 2021-05-29T14:44:59.157Z      INFO    application/managed_mode.go:291 Agent is starting
fleet_server_1  | 2021-05-29T14:44:59.157Z      INFO    [api]   api/server.go:64        Metrics endpoint listening on: /usr/share/elastic-agent/state/data/tmp/elastic-agent.sock (configured: unix:///usr/share/elastic-agent/state/data/tmp/elastic-agent.sock)
fleet_server_1  | 2021-05-29T14:44:59.257Z      WARN    application/managed_mode.go:304 failed to ack update open /usr/share/elastic-agent/state/data/.update-marker: no such file or directory
fleet_server_1  | 2021-05-29T14:45:00.209Z      INFO    log/reporter.go:40      2021-05-29T14:45:00Z - message: Application: metricbeat--7.13.0--36643631373035623733363936343635[e03065cf-b804-4c2a-85aa-0c034ff3d6ed]: State changed to RUNNING: Running - type: 'STATE' - sub_type: 'RUNNING'

edit: Agents also continuously log:
[elastic_agent][info] No events received within 10m0s, restarting watch call

Osqueries are routed though fleet-server.

Can you attach all of the osquerybeat-json.log's for the three hosts?

Thanks!

Hey,

I'm pretty sure now the culprit is not osquerybeat. I ran a live query that didn't work, and once I restarted fleet server, it suddenly worked for one time (on all hosts!). Starting the next query then doesn't work again. It's very inconsistent.

I've now found the fleet-server logs, and they spit out some errors:

{"log.level":"info","index":".fleet-policies","ctx":"index monitor","error.message":"elastic fail 400:status_exception:wait_for_advance only supports indices with one shard. [shard count: 2]","@timestamp":"2021-06-13T16:38:14.062Z","message":"failed on waiting for global checkpoints advance"}
{"log.level":"info","index":".fleet-actions","ctx":"index monitor","error.message":"elastic fail 400:status_exception:wait_for_advance only supports indices with one shard. [shard count: 2]","@timestamp":"2021-06-13T16:38:14.423Z","message":"failed on waiting for global checkpoints advance"}

I'm overriding all templates to use 2 shards (I'm using a 2-node setup) like this:

      {
        "index_patterns": [
          "*"
          ],
        "order": 10,
        "settings": {
          "number_of_shards": "2",
          "number_of_replicas": "1"
          }
        }
      }

Could this be the root of this problem?

Edit: Sorry, you asked for the osquerybeat.log files - when the issue occurs (I start a live query and it's not going through) there is nothing in the logs. Just repeating "monitoring" lines.
I restarted the beat to get you the startup sequence, here it is:
osquerybeat-json.log

I've found issues on github related to custom templates overriding system templates, and possibly damaging the cluster:

There doesn't seem to be a solution, as these discussions are older. Maybe there's a new feature that would protect the system indices .fleet-policies and .fleet-actions from such tampering?

I think for now, I will workaround this by changing my catch-all template to match the patterns

"a*",
"b*",
"c*",
...

.... in order to avoid matching anything starting with a "dot". Its quite ugly though - anyone got a better idea? Also, how would I apply the new number of shards once the underlying index template is changed? I can't rollover because that throws:

POST /.fleet-actions/_rollover
{
  "conditions": {
    "max_docs":  1
  }
}
{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "index name [.fleet-actions] does not match pattern '^.*-\\d+$'"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "index name [.fleet-actions] does not match pattern '^.*-\\d+$'"
  },
  "status" : 400
}

1 Like

@nemhods I don't think we provide any good way yet to change the settings of the system indices. Could you share a bit more background on the reason you want to change the defaults?

I actually didn't want to change defaults of the system indices.
I only wanted to set a cluster-wide default shard count of 2 for data indices (default of 5 seemed too much for my 2-node cluster). The only way i knew how, was to define a template that matches indices like "*" - which apparently includes system indices.

I didn't know that this could cause problems for system indices.

Is there a way to "repair" system indices with too many shards? Rolling indices over doesn't help with changing shard count right?

I think I know better understand what is happening here. You set a "global" template with higher priority which basically overwrites any other template. Did you use an "old" template or an index/component template? Could you share the exact API call you used?

In general I agree, even if you set a * match pattern, system indices should likely not be affected by this. I'll test this on my end and follow up with an issue if needed.

Assuming you use component templates, there is another issue here. Component templates do not support inheritance. So if you have a global template that gets picked, all the mappings and everything else in the other templates is ignored.

1 Like

That's exactly right. It's a legacy template, I used something along the lines of:

PUT _template/my_default_template
      {
        "index_patterns": [
          "*"
          ],
        "order": 10,
        "settings": {
          "number_of_shards": "2",
          "number_of_replicas": "1"
          }
        }
      }

Right now, the only way of repairing this that I can think of is:
take Kibana offline, and do a re-index or use the _shrink API to reduce the number of primary shards for the affected indices .fleet-actions and .fleet-policies, now that they're not affected by my custom template anymore.

Is it plausible that this could prevent the Fleet Server from working properly, and especially from relaying OSQuery queries? Is the above mentioned failed on waiting for global checkpoints advance a critical error for the fleet server?

Okay, so I performed the following surgery:

PUT .fleet-policies-7/_settings
{
  "index": {
    "blocks.read_only": true
  }
}

POST .fleet-policies-7/_shrink/.fleet-policies-smol
{
  "settings": {
    "index.routing.allocation.require._name": null, 
    "index.blocks.write": null 
  }
}

PUT .fleet-policies-7/_settings
{
  "index": {
    "blocks.read_only": false
  }
}

DELETE .fleet-policies-7

POST .fleet-policies-smol/_clone/.fleet-policies-7

POST /_aliases
{
  "actions": [
    {
      "add": {
        "index": ".fleet-policies-7",
        "alias": ".fleet-policies",
        "is_write_index": true
      }
    }
  ]
}

(again for the .fleet-actions index.)

after that, I'm getting the following on the fleet server:
{"log.level":"info","index":".fleet-policies","ctx":"index monitor","error.message":"elastic fail 400:status_exception:number of checkpoints must equal number of shards. [shard count: 1, checkpoint count: 2]","@timestamp":"2021-06-16T20:26:07.228Z","message":"failed on waiting for global checkpoints advance"}
... but a fleet server restart solves this.

I can now perform live queries, and the results almost come back immediately. It's like magic!

I will try to weigh in on the Github issues that i found with my story here, so that we may get some system index protection from rogue templates in the future.

1 Like

Hi, @nemhods. I'm an Elasticsearch developer working on system indices and I saw the comment you posted in Github. We will either respond to your comment there or create a new Github issue for this, but I wanted to respond a little bit more informally here first.

First off, that was some impressive detective work and a good fix that you found. I'm sorry that you hit this issue in the first place, but I'm glad you got to the bottom of it.

The issue you ran into is that user-defined templates with broad wildcards can inadvertently affect the indexes that Elastic components are using and managing for their own purposes. We've got a few different features that are meant to avoid this danger.

First, there's the ongoing development of System Indices. When the system index feature is fully implemented, index templates won't be able to affect any system indices. But we're still developing this feature in stages, so unfortunately we're not quite there yet.

Second, composable index templates are designed to avoid some of these problems. For example, you can set priorities on your templates to avoid clashing with system templates. The bad news is that I'm not sure you can do a pure wildcard in a composable template; you might need to do the "a*" to "z*" patterns you described earlier. I'm not totally sure about this one.

Third, data streams are a more powerful but more complicated way of letting Elasticsearch manage indices and settings. (They might not be right for this use case, though!)

All that being said, that was some pretty cryptic and nasty behavior that you ran into, and we really don't want other Fleet users to have the same experience you did. I'll see if there's a way that we can do anything about rogue templates and these fleet indices in particular, and at the very least I want to make sure your solution has good visibility for other users. Thanks for your patience in working through it, and keep an eye on Github for another followup.

-William

1 Like

Thanks @William_Brafford for chiming in. Can you reference the Github issue here where the conversation about protecting system indices happens?

@nemhods Great that you got it working. Multiple shards can have indeed an effect on fleet-server actions. It is something also we plan to improve in the future to make sure it works better with multiple shards.

Thanks for all your work on this one.

2 Likes

@ruflin I've created a Github issue for a possible short term fix here: Legacy index templates should not apply to "net new" system indices · Issue #74271 · elastic/elasticsearch · GitHub

I'll update that issue when I know whether or not it will apply to the Fleet issue specifically.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.