How to get data into the beats specific index via logstash

The use case here is that we have:

*beats -> logstash -> elasticsearch cloud

The following requirements are in place:

  • The hosts running the beats do not have direct internet access and can only communicate via logstash.
  • Logstash must be used (it's the easiest to work with for data enrichment) since there are some significant data enrichment processes in use using grok to parse and extract key pieces of information from the logs (as well as geoip data and a whole host more).
  • Either use of data streams for all data coming from the beats correctly split by environment (namespace), dataset, and type; or write material to the correct beats specific default created indexes using ILM.

Currently we're on 7.15 for the beats and logstash.

Everything i've found related to the above fails in one or more ways of our requirements.
The whole goal is to use the defaults (ILM, Dashboards, etc) provided as much as possible and the simplest non-API setup to ensure it is easily reproducible. (We'd prefer to use Infrastructure as Code but that's for later, need something working ASAP).

It's not clear what problems you are having with this approach?

To clarify for more information:

The *beats themselves don't add any of the data_stream parameters used which means you need to figure out which are what yourself (is it of type logs? or metrics? this is in particular with regards to the auditbeat data presented where some are clearly a log entry, but others like socket are unclear). Is there a clear concise way of marking these events with the right data_stream types. Is this even the right approach?

The ILM index pattern used by the *beats directlly when writing to an Elasticsearch endpoint have a clear delineation about the index, ie. metricbeat-*, etc. However when the material passes through to logstash and you want to write it to Elasticsearch itself via the output, there is no clear way of reflecting the original index destination as you would get from Elasticsearch.

If I have a single logstash instance which receives all the material from all the *beats, how do I go about splitting that out so that each goes through to the right destination. Stuff that's a data stream goes to the right place, metricbeat material goes to the right metricbeat index so that the default metricbeat dashboards can utilise it after having been enriched by logstash, samilarly for filebeat, and auditbeat.

not sure about datastream, but beats input have [metadata] fields which you can use to write to the correct beats index pattern and apply ILM etc

this requires a previuosly configured templates, ILM, and index pattern, plus applicable modules through beats setup procedure.

Quite simple, nothing about it seems to work consistently across all pieces to satisfy all requirements.
The only one that's been simple enough has been syslog, and just by guessing to add in some "suitable" data_stream values that hopefully work with any default dashboards from filebeat (i've not imported them yet).

An input like this

  # Handle RFC3164 syslog messages on both TCP and UDP (these are handled by
  # default with the input parser)
  syslog {
    id => "syslog-rfc3164"
    type => "syslog"
    port => 1514
    timezone => "UTC"
    ecs_compatibility => "v1"

    # NB: you cannot use comments inside the add_field section, will cause a parse error
    # data stream fields part of ECS identify the application (dataset),
    # namespace (environment), and type (logs or metrics)
    # syslog line add's the format to the syslog specific section
    add_field => {
      "[labels][environment]" => "staging"
      "[data_stream][namespace]" => "staging"
      "[data_stream][type]" => "logs"
      "[data_stream][dataset]" => "syslog"
    }
    tags => ["staging", "syslog", "rfc3164"]
  }

combined with an output like this

  elasticsearch {
    id => "output-to-cloud-elastic"
    cloud_auth => "<redacted>"
    cloud_id => "<redacted>"
    data_stream => "true"
    data_stream_auto_routing => "true"
    data_stream_type => "logs"
    data_stream_dataset => "syslog"
    data_stream_namespace => "staging"
    ecs_compatibility => "v1"
  }

Data appears to be arriving in the elastic cloud instance and going to the logs-syslog-staging data stream. Sofar so good.

Next I'd like to get the output from metricbeat into this, we start with a simple box for now

metricbeat.config.modules:
  # Glob pattern for configuration loading
  path: ${path.config}/modules.d/*.yml

  # Set to true to enable config reloading
  reload.enabled: false

# ------------------------------ Logstash Output -------------------------------
output.logstash:
  enabled: true
  # The Logstash hosts
  hosts: ["logstash:15044"]

processors:
  - add_host_metadata: ~
  - add_cloud_metadata: ~
  - add_docker_metadata: ~
  - add_kubernetes_metadata: ~

For the logstash input we have:

input {
  beats {
    port => 15044
    id => "metrics"
    type => "beats"
    ecs_compatibility => "v1"
    add_field => {
      "[labels][environment]" => "staging"
    }
    tags => ["staging", "beats", "metrics"]
  }
}

then some filter data enrichment (geoip stuff)
then when we want to send this data to Elasticsearch, we have the following problems:

  • The data_stream approach
    • There is no data_stream information coming out of metric beat itself, so we'd have to add them ourselves based upon the event.dataset.
    • However looking at a console output of metricbeat, not everything is in fact a metric, there's the occasional log message in there from the beat.state module. So we'd need some way of distinguishing between them.
    • Now we need to get the data into an index that matches the pattern used by the Metricbeat created Dashboards. Which expects things in the metricbeat-* index filter.
    • And then we finally want to ensure some form of ILM is used to ensure that data/logs are not kept for longer than X days or Y size.
    • Ideally the generated data streams match those of the output of other tools in the future (or current) like that of the ElasticAgent.
  • Default ILM based index approach
    • In metricbeat if you configure an output direct to Elasticsearch, with ILM enabled we get

      When index lifecycle management (ILM) is enabled, the default index is "metricbeat-%{[agent.version]}-%{+yyyy.MM.dd}-%{index_num}" , for example, "metricbeat-7.15.0-2021-10-07-000001" . Custom index settings are ignored when ILM is enabled.

    • now when we use ILM in logstash, the index created seems to be (due to the use of ecs_compatibility) in the form of "ecs-logstash-%{+yyyy.MM.dd}". This is not the index format expected and used by the dashboards so we'd somehow need to change it to an appropriate index with ILM settings that match those expected by the dashboards and match the metricbeat form of "metricbeat-%{[agent.version]}-%{+yyyy.MM.dd}-%{index_num}".
    • For performance and resource usage and to minimise the number of connections to logstash, this would need to handle not only metricbeat, but also auditbeat, and filebeat (at a minimum) each with their own index form requirements.
  • The ElasticAgent approach
    • Doesn't work since there's kubernetes clusters involved that need scraping and nothing goes on the clusters that doesn't have a Helm chart for their deployment.
    • Doesn't work because the majority of endpoints do not have direct Internet access and as as such cannot communicate with the Fleet Server on the cloud Elastic deployment directly
    • The Standalone configuration could be an option by sending data to Logstash with an elastic_agent input. However other than the basics of installing a standalone instance, there is no documentation on configuring the standalone elastic agent to monitor specific log files, metrics, processes, scrape metrics from prometheus exporters running on the same host, etc.

I'd be happy if any of these options somehow worked or could be made to work with minimal effort.
There's probably something I'm missing here with regards to the life cycle material, but there's a couple of simple rules we go by when evaluating new products. (it's a small team and there just isn't enough time to spend a month on something)

  • Can a Proof of Concept be done in <5 days
  • Can a production ready secure deployment be done in <3 days starting from nothing using the majority of defaults and recommended settings (a few extra days available if it can be done completely via Infrastructure as Code).

Regarding that index pattern and using the [@metadata] fields, we get that lovely line you're referencing and a bit above that we have:

If ILM is not being used, set index to %{[@metadata][beat]}-%{[@metadata][version]}-%{+YYYY.MM.dd} instead so Logstash creates an index per day, based on the @timestamp value of the events coming from Beats.

Of course we want ILM to be used, logs and metrics need to be rotated and archived/discarded after a fixed time period. This section doesn't clearly identify what needs to be done to get the material into the correct index with appropriate ILM attached.

The section assumes that you have previously configured beats to work with Elasticsearch through the use of beats setup -e methods.

For example , if you want to use filebeat's provided template and ILM :

  1. Install filebeat on a machine that has access to your ES instance. you can do this on your logstash instance.
  2. Configure filebeat output to your ES instance (this is required to setup the required index templates, ILM policies, and pipelines)
  3. Run filebeat setup -e . This will configure ES with filebeat templates and setup ILM. If you need dashboards to be setup as well, be sure to configure required kibana configuration. more detailed documentation here .
  4. Once you confirm index templates and ILM policies are configured, you can configure your filebeat to send output to logstash, and in logstash you can do
output {
  index => "%{[@metadata][beat]}-%{[@metadata][version]}-%{+YYYY.MM.dd}"
}

that will allow you to write to beat's specific index-pattern, using specific index template and policies for each beats.

Awesome, thanks. I think I can make it work from here with that (i hope). It's been frustrating with pieces of information all over the place, and last time I looked at this was a year ago.

All good, kindof.

The next error i'm getting is that the 2 indices created now display an Index lifecycle error.
An example is:

ndex lifecycle error
illegal_argument_exception: index.lifecycle.rollover_alias [auditbeat-7.15.0] does not point to index [auditbeat-7.15.0-staging-2021.10.12]

i've stuck the environment in the index pattern to make it easier to distinguish things and since the index patterns seem to be either <beat>-* or <beat>-<version>-* this shouldn't be causing the issue.

Prior to loading any data into the system I'd gone trhough and used the beat setup -e to create the Legacy index templates, ILM policies, and dashboards. Guessing something else isn't quite right somewhere?

The output being used looks like the below

if [@metadata][beat] {
    elasticsearch {
      id => "output-to-cloud-elastic-as-metricbeat"
      cloud_auth => "<redacted>"
      cloud_id => "<redacted>"
      ecs_compatibility => "v1"
      data_stream => "false"      
      index => "%{[@metadata][beat]}-%{[@metadata][version]}-%{[labels][environment]}-%{+YYYY.MM.dd}"
     
    }

Any suggestion on what and where to go to fix these?

Think i've solved those as well by explicitly mentioning the ilm_policy in the Elasticsearch output.

Nope, it's not solved still getting errors.

Or whatever these mean
Screenshot_20211013_022624

Interestingly we seem to have 2 filebeat indices
Screenshot_20211013_023002
And that second one with the 000001 does have an alias defined for filebeat-7.15.0, which looks like that was the one created using the beat setup -e for the relevant beats.

How do we rectify this mismatch so that the correct one is getting populated with data and has the right lifecycle policy and alias configured?

the current output Elasticsearch config looks like

      ecs_compatibility => "v1"
      data_stream => "false"      
      ilm_enabled => "true"
      ilm_policy => "%{[@metadata][beat]}"
      index => "%{[@metadata][beat]}-%{[@metadata][version]}-%{[labels][environment]}-%{+YYYY.MM.dd}"

you will probably get a better answer in Elasticsearch forum for that particular issue

my guess is that adding the %{[label][version]} breaks ILM for that index. the default filebeat index template will create filebeat index with alias filebeat-7.15.0. ILM policies defined by beats setup will only work for the default index template and default index name, and modifying index name will require modification on index template, ILM, etc.

for your use case, I think you will need to create separate index template with separate alias for each type of logs. you can only have one write index for one alias, so filebeat-7.15.0-staging-* will need different alias from filebeat-7.15.0-prod-* for example.

If ilm_enabled is set then the index option is ignored.

ilm_policy is used during initialization, at which point no events exist, so it cannot use a sprintf reference.

1 Like

@Badger that seems to be true for Elasticsearch directly in the beats, but this is inside logstash where this functionality seems to work quite well.

I don't use beats, so I cannot speak to that, but what I said is true, and documented for the logstash Elasticsearch output.

Then how the heck is this working.. I'm getting material written into the index specified by that pattern quite happily. So the index IS being utilised. There are no beats of any kind writing direct to Elasticsearch, everything goes via logstash.


The ones with the 000001 are the ones created by manually running the setup -e for each of the beats in question on the ES cluster itself. Which is only done to ensure the dashboards and policies are loaded.

Then logstash with the setting of the below (i've removed my labels/environment as part of this testing), is generating those beats specific data events. So if as you say the ilm_enabled setting overrides the index.. then how come my events are written to the index as specified by my config.

if [@metadata][beat] {
    elasticsearch {
      id => "output-to-cloud-elastic-as-metricbeat"
      cloud_auth => "<redacted>"
      cloud_id =>  "<redacted>"
      ecs_compatibility => "v1"
      data_stream => "false"      
      ilm_enabled => "true"
      ilm_policy => "%{[@metadata][beat]}"
      index => "%{[@metadata][beat]}-%{[@metadata][version]}-%{+YYYY.MM.dd}"
    }
  } else {
    # anything that's not from a beat goes to the default elasticsearch output
    elasticsearch {
      id => "output-to-cloud-elastic-failover"
      # TODO: these need to be secret
      cloud_auth => "<redacted>"
      cloud_id =>  "<redacted>"
      ecs_compatibility => "v1"
      data_stream => "false"
      ilm_enabled => "true"
      ilm_policy => "logs"
      ilm_rollover_alias => "ecs-logstash"
    }
  }

Either way.. that last Elasticsearch output never gets hit by the data coming in, and the only problem with the first Elasticsearch output at the moment is that the alias isn't getting applied (or docs written to the relevant index) however you want to look at it for. The logstash generated indices, just seem to be missing the aliases value(s).

And that error is a pain in the butt since it's not showing up all the time, reloading the page makes it go away sometimes, and they come back when you log out and back in again.

Are you sure there are not beat agents with the template set to overwrite somewhere in your environment?

Yes, there's only one set of beats running on one host at this stage, and all configured to write to logstash.
auditbeat
metricbeat
filebeat

And i know this is valid, because the host they are runing on does not have direct internet connectivity, and the Elasticsearch being used is the cloud variant and the only thing configured with the creds to write to it is the logstash instance

For the ILM to work, the templates need to end in a number that needs to be able to increment. So maybe the ILM setup is clashing with the logstash setup. Also in the beats agents yml is the index set to default (legacy) instead if index>

Here is my working setup.

output
{
if [@metadata][pipeline]
  {
    elasticsearch
    {
      hosts    => ['https://elasticsearch.mydomain.com:443']
      data_stream => "false"
      ilm_enabled => "true"
      ilm_pattern => "{now/d}-000001"
      ilm_policy => "%{[@metadata][beat]}"
      ilm_rollover_alias => "%{[@metadata][beat]}-%{[@metadata][version]}"
      manage_template => false
      index => "%{[@metadata][beat]}-%{[@metadata][version]}-%{+YYYY.MM.dd}"
      ssl => true
      ssl_certificate_verification => true
      ecs_compatibility => "v1"
      pipeline => "%{[@metadata][pipeline]}" 
    }
  }
else 
  {
    elasticsearch
    {
      hosts    => ['https://elasticsearch.mydomain.com:443']
      data_stream => "false"
      ilm_enabled => "true"
      ilm_pattern => "{now/d}-000001"
      ilm_policy => "%{[@metadata][beat]}"
      ilm_rollover_alias => "%{[@metadata][beat]}-%{[@metadata][version]}"
      manage_template => false
      index => "%{[@metadata][beat]}-%{[@metadata][version]}-%{+YYYY.MM.dd}"
      ssl => true
      ssl_certificate_verification => true
      ecs_compatibility => "v1"
    }
  }
}

Awesome, that worked, no more errors. I guess what threw me on the rollover_alias was this line

ilm_rollover_alias does NOT support dynamic variable substitution as index does.

1 Like