Automatic Index Creation

Hi Team,

We had a requirement to create an index automatically by reading the content of the CSV files. Within the CSV files, there is a field named "timestamp" where data is received in epoch format. The objective is to create an day-wise index based on the timestamp field itself. In our environment, we utilize Filebeat to ingest data into these indices.

Kindly provide guidance on how to accomplish this task.

Thanks,
Debasis

Hi @Debasis_Mallick,

Filebeat should be able to pick up the files if you specify a regex matching the path of the files. There are examples in the input documentation for log files that can be adapted.

For processing the code you could use the decode-csv processor.

Finally, creating the index can be down using the Elasticsearch output process, and the index to be created can be specified using the index option.

Hope that helps!

@carly.richmond Thanks for your response. Let me check and will get back in case of help required.

Thanks,
Debasis

If you want to send data to the index based on a field in the CSV file you may also need to use a timestamp processor once the line has been parsed.

1 Like

@Christian_Dahlqvist and @carly.richmond.
Below is one record which we received in csv file. The sixth field is the timestamp "1689182220000" (epoch format) on which basis Index should created automatic.

1711807094739562106740,31,INDAT,33,11711807094739562106740,1689182220000,V2,542,7,,59,7208,2,,,44,1,919968881699,8,IND19,1,919103699945,8,INDAT,404971040417284,,,Omgg so late baby😭😭,30458,1,CFA-9423,919968881699,1,1,,MSISDN,P2P,,,IND19,919968881699,0,0,MT

Could you please help me how we can process this timestamp field.

Thanks,
Debasis

@carly.richmond After changing the filebeat.yml file , We are still getting below error.

timestamp":"2024-06-27T11:25:20.408+0530","log.origin":{"file.name":"instance/beat.go","file.line":1274},"message":"Exiting: error initializing processors: the processor action date does not exist.

Below is the filebeat.yml file in my env.

[root@cb-4 filebeat]# cat filebeat.yml
###################### Filebeat Configuration Example #########################

# This file is an example configuration file highlighting only the most common
# options. The filebeat.reference.yml file from the same directory contains all the
# supported options with more comments. You can use it as a reference.
#
# You can find the full configuration reference here:
# https://www.elastic.co/guide/en/beats/filebeat/index.html

# For more available modules and options, please see the filebeat.reference.yml sample
# configuration file.

# ============================== Filebeat inputs ===============================

filebeat.inputs:

# Each - is an input. Most options can be set at the input level, so
# you can use different inputs for various configurations.
# Below are the input specific configurations.

# filestream is an input for collecting log messages from files.

  # Unique ID among all inputs, an ID is required.

  # Change to true to enable this input configuration.

  # Paths that should be crawled and fetched. Glob based paths.

- type: log
  enabled: true
  paths:
   - /cbdata/elastic/cb4lv1/cb4transv4/*.csv # path to your CSV file
  exclude_lines: ['^\"\"']          # header line
  pipeline: parse_elastic_data_v2
  harvestor_buffer_size: 40960000
  close_eof: true
  close_inactive: 10s

filebeat.registry.flush: 60s

 

# ============================== Filebeat modules ==============================

filebeat.config.modules:
  # Glob pattern for configuration loading
  path: ${path.config}/modules.d/*.yml

  # Set to true to enable config reloading
  reload.enabled: false

  # Period on which files under path should be checked for changes
  #reload.period: 10s

# ======================= Elasticsearch template setting =======================

setup.template.settings:
  index.number_of_shards: 1
  #index.codec: best_compression
  #_source.enabled: false

setup.template.name: "sfwindextemplate"
setup.template.pattern: "sfwindex-*"
setup.template.enabled: true

#setup.ilm.enabled: true
#setup.ilm.policy_name: sfwilm



# =================================== Kibana ===================================

# Starting with Beats version 6.0.0, the dashboards are loaded via the Kibana API.
# This requires a Kibana endpoint configuration.
setup.kibana:

  # Kibana Host
  # Scheme and port can be left out and will be set to the default (http and 5601)
  # In case you specify and additional path, the scheme is required: http://localhost:5601/path
  # IPv6 addresses should always be defined as: https://[2001:db8::1]:5601
  host: "http://10.10.17.54:5601"

  # Kibana Space ID
  # ID of the Kibana Space into which the dashboards should be loaded. By default,
  # the Default Space will be used.
  #space.id:

# ================================== Outputs ===================================

# Configure what output to use when sending the data collected by the beat.

# ---------------------------- Elasticsearch Output ----------------------------
output.elasticsearch:
  # Array of hosts to connect to.
  hosts: ["https://10.10.18.174:9200","https://10.10.18.215:9200"]
  worker: 8
  bulk_max_size: 3000
  # Protocol - either `http` (default) or `https`.
  protocol: "https"
  index: "sfwindex-%{+yyyy.MM.dd}-%{index_num}"


  # Authentication credentials - either API key or username/password.
  # api_key: "id:api_key"
  username: "elastic"
  password: "elastic"
  ssl:
    enabled: true
    certificate_authorities: ["/etc/filebeat/certs/cert.pem"]
# ================================= Processors =================================
processors:

 - decode_csv_fields:
      fields:
        message: 2023-07-09_17118023455.csv
      separator: ","
      trim_leading_space: true
      overwrite_keys: true

  - drop_fields:

      fields: ["agent" , "ecs" , "host" , "input" , "sequence" , "component" , "edr_version" , "prov_version" , "msisdn_ton" , "msisdn_npi"]
      ignore_missing: false

  - date:

      field: "timestamp"
      target_field: "@timestamp"
      formats:
         - "UNIX_MS"

# ================================== Logging ===================================

# Sets log level. The default log level is info.
# Available log levels are: error, warning, info, debug
logging:
  level: debug
  to_files: true
  files:
    path: /var/log/filebeat
    name: filebeat.log
    keepfiles: 7

# At debug level, you can selectively enable logging only for some components.
# To enable all selectors use ["*"]. Examples of other selectors are "beat",
# "publisher", "service".
#logging.selectors: ["*"]

path.logs: /var/log/filebeat

# ============================= X-Pack Monitoring ==============================
# Filebeat can export internal metrics to a central Elasticsearch monitoring
# cluster.  This requires xpack monitoring to be enabled in Elasticsearch.  The
# reporting is disabled by default.

# Set to true to enable the monitoring reporter.
#monitoring.enabled: false

monitoring:
  enabled: true
  elasticsearch:
    username: beats_system
    password: beats_system

Thanks,
Debasis

In filebeat I believe the name of the processor is timestamp and not date. Please see the docs I linked to earlier.

@Christian_Dahlqvist Sorry I had missed that one. After changing the processor to "timestamp" also getting same kind of error.

Thanks,
Debasis

Did you check the processor configuration syntax against the docs?

Yes, it was checked.

Thanks,
Debasis

What does the config look like now? What is the exact error you are getting?

Please find the filebeat.yml file as below.

###################### Filebeat Configuration Example #########################

# This file is an example configuration file highlighting only the most common
# options. The filebeat.reference.yml file from the same directory contains all the
# supported options with more comments. You can use it as a reference.
#
# You can find the full configuration reference here:
# https://www.elastic.co/guide/en/beats/filebeat/index.html

# For more available modules and options, please see the filebeat.reference.yml sample
# configuration file.

# ============================== Filebeat inputs ===============================

filebeat.inputs:

# Each - is an input. Most options can be set at the input level, so
# you can use different inputs for various configurations.
# Below are the input specific configurations.

# filestream is an input for collecting log messages from files.

  # Unique ID among all inputs, an ID is required.

  # Change to true to enable this input configuration.

  # Paths that should be crawled and fetched. Glob based paths.

- type: log
  enabled: true
  paths:
   - /cbdata/elastic/cb4lv1/cb4transv4/*.csv # path to your CSV file
  exclude_lines: ['^\"\"']          # header line
  pipeline: parse_elastic_data_v2
  harvestor_buffer_size: 40960000
  close_eof: true
  close_inactive: 10s

filebeat.registry.flush: 60s

  
# ============================== Filebeat modules ==============================

filebeat.config.modules:
  # Glob pattern for configuration loading
  path: ${path.config}/modules.d/*.yml

  # Set to true to enable config reloading
  reload.enabled: false

  # Period on which files under path should be checked for changes
  #reload.period: 10s

# ======================= Elasticsearch template setting =======================

setup.template.settings:
  index.number_of_shards: 1
  #index.codec: best_compression
  #_source.enabled: false

setup.template.name: "sfwindextemplate"
setup.template.pattern: "sfwindex-*"
setup.template.enabled: true

#setup.ilm.enabled: true
#setup.ilm.policy_name: sfwilm


# ================================== General ===================================

# The name of the shipper that publishes the network data. It can be used to group
# all the transactions sent by a single shipper in the web interface.
#name:

# The tags of the shipper are included in their own field with each
# transaction published.
#tags: ["service-X", "web-tier"]

# Optional fields that you can specify to add additional information to the
# output.
#fields:
#  env: staging

# ================================= Dashboards =================================
# These settings control loading the sample dashboards to the Kibana index. Loading
# the dashboards is disabled by default and can be enabled either by setting the
# options here or by using the `setup` command.
#setup.dashboards.enabled: false

# The URL from where to download the dashboards archive. By default this URL
# has a value which is computed based on the Beat name and version. For released
# versions, this URL points to the dashboard archive on the artifacts.elastic.co
# website.
#setup.dashboards.url:

# =================================== Kibana ===================================

# Starting with Beats version 6.0.0, the dashboards are loaded via the Kibana API.
# This requires a Kibana endpoint configuration.
setup.kibana:

  # Kibana Host
  # Scheme and port can be left out and will be set to the default (http and 5601)
  # In case you specify and additional path, the scheme is required: http://localhost:5601/path
  # IPv6 addresses should always be defined as: https://[2001:db8::1]:5601
  host: "http://10.10.17.54:5601"

  # Kibana Space ID
  # ID of the Kibana Space into which the dashboards should be loaded. By default,
  # the Default Space will be used.
  #space.id:

# =============================== Elastic Cloud ================================

# These settings simplify using Filebeat with the Elastic Cloud (https://cloud.elastic.co/).

# The cloud.id setting overwrites the `output.elasticsearch.hosts` and
# `setup.kibana.host` options.
# You can find the `cloud.id` in the Elastic Cloud web UI.
#cloud.id:

# The cloud.auth setting overwrites the `output.elasticsearch.username` and
# `output.elasticsearch.password` settings. The format is `<user>:<pass>`.
#cloud.auth:

# ================================== Outputs ===================================

# Configure what output to use when sending the data collected by the beat.

# ---------------------------- Elasticsearch Output ----------------------------
output.elasticsearch:
  # Array of hosts to connect to.
  hosts: ["https://10.10.18.174:9200","https://10.10.18.215:9200"]
  worker: 8
  bulk_max_size: 3000
  # Protocol - either `http` (default) or `https`.
  protocol: "https"
  index: "sfwindex-%{+yyyy.MM.dd}-%{index_num}"

  # Authentication credentials - either API key or username/password.
  # api_key: "id:api_key"
  username: "elastic"
  password: "elastic"
  ssl:
    enabled: true
    certificate_authorities: ["/etc/filebeat/certs/cert.pem"]
# ------------------------------ Logstash Output -------------------------------
#output.logstash:
  # The Logstash hosts
  #hosts: ["localhost:5044"]

  # Optional SSL. By default is off.
  # List of root certificates for HTTPS server verifications
  #ssl.certificate_authorities: ["/etc/pki/root/ca.pem"]

  # Certificate for SSL client authentication
  #ssl.certificate: "/etc/pki/client/cert.pem"

  # Client Certificate Key
  #ssl.key: "/etc/pki/client/cert.key"

# ================================= Processors =================================
processors:

 - decode_csv_fields:
      fields:
        message: 2023-07-09_17118023455.csv
      separator: ","
      trim_leading_space: true
      overwrite_keys: true

  - drop_fields:

      fields: ["agent" , "ecs" , "host" , "input" , "sequence" , "component" , "edr_version" , "prov_version" , "msisdn_ton" , "msisdn_npi"]
      ignore_missing: false

  - timestamp:

      field: "timestamp"
      target_field: "@timestamp"
      formats:
         - "UNIX_MS"

# ================================== Logging ===================================

# Sets log level. The default log level is info.
# Available log levels are: error, warning, info, debug
logging:
  level: debug
  to_files: true
  files:
    path: /var/log/filebeat
    name: filebeat.log
    keepfiles: 7

# At debug level, you can selectively enable logging only for some components.
# To enable all selectors use ["*"]. Examples of other selectors are "beat",
# "publisher", "service".
#logging.selectors: ["*"]

path.logs: /var/log/filebeat

# ============================= X-Pack Monitoring ==============================
# Filebeat can export internal metrics to a central Elasticsearch monitoring
# cluster.  This requires xpack monitoring to be enabled in Elasticsearch.  The
# reporting is disabled by default.

# Set to true to enable the monitoring reporter.
#monitoring.enabled: false

monitoring:
  enabled: true
  elasticsearch:
    username: beats_system
    password: beats_system

The error is as below.

9-98bf-5d2487aef70f","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-06-27T11:25:20.408+0530","log.origin":{"file.name":"instance/beat.go","file.line":1274},"message":"Exiting: error initializing processors: the processor action date does not exist.

Thanks,
Debasis

It is still trying to use a processor named date, so it is not running the filebeat.yml you shared.

Did you restart filebeat?

Should this be layouts?

Try this:

  - timestamp:

      field: "timestamp"
      target_field: "@timestamp"
      layouts:
         - "UNIX_MS"

@leandrojmp Now the index has been created, but it was created with the current date. However, the requirement is that Filebeat should read the timestamp field from the CSV file (6th Field) and generate the index based on the date mentioned in the "timestamp" field data. In the following example, if we convert the epoch date (1689182220000) to its actual format, it will correspond to 2023-07-12. Consequently, the index should be created as sfwindex-20230712.

1711807094739562106740,31,INDAT,33,11711807094739562106740,1689182220000,V2,542,7,,59,7208,2,,,44,1,919968881699,8,IND19,1,919103699945,8,INDAT,404971040417284,,,Omgg so late baby😭😭,30458,1,CFA-9423,919968881699,1,1,,MSISDN,P2P,,,IND19,919968881699,0,0,MT

Please advise.

Thanks,
Debasis

All timestamps in Elasticsearch are stored in UTC, so I believe this is what the timestamp processor therefore converts to. The date in the index name may therefore potentially depend on the timezone you are located in. You may be able to get the effect you are looking for by adjusting the timezone setting in the processor, but then it is possible that the timestamp instead would be wrong in Elasticsearch and Kibana.

Please share the output you have in Elasticsearch.

It is not clear what is the output you are getting from your processors in your filebeat.yml

@leandrojmp Please find the filebeat.yml file for your reference.


[root@cb-4 filebeat]# cat filebeat.yml
###################### Filebeat Configuration Example #########################

# This file is an example configuration file highlighting only the most common
# options. The filebeat.reference.yml file from the same directory contains all the
# supported options with more comments. You can use it as a reference.
#
# You can find the full configuration reference here:
# https://www.elastic.co/guide/en/beats/filebeat/index.html

# For more available modules and options, please see the filebeat.reference.yml sample
# configuration file.

# ============================== Filebeat inputs ===============================

filebeat.inputs:

# Each - is an input. Most options can be set at the input level, so
# you can use different inputs for various configurations.
# Below are the input specific configurations.

# filestream is an input for collecting log messages from files.

  # Unique ID among all inputs, an ID is required.

  # Change to true to enable this input configuration.

  # Paths that should be crawled and fetched. Glob based paths.

- type: log
  enabled: true
  paths:
   - /cbdata/elastic/cb4lv1/cb4transv4/*.csv # path to your CSV file
  exclude_lines: ['^\"\"']          # header line
  pipeline: parse_elastic_data_v2
  harvestor_buffer_size: 40960000
  close_eof: true
  close_inactive: 10s

filebeat.registry.flush: 60s

  # Exclude lines. A list of regular expressions to match. It drops the lines that are
  # matching any regular expression from the list.
  # Line filtering happens after the parsers pipeline. If you would like to filter lines
  # before parsers, use include_message parser.
  #exclude_lines: ['^DBG']

  # Include lines. A list of regular expressions to match. It exports the lines that are
  # matching any regular expression from the list.
  # Line filtering happens after the parsers pipeline. If you would like to filter lines
  # before parsers, use include_message parser.
  #include_lines: ['^ERR', '^WARN']

  # Exclude files. A list of regular expressions to match. Filebeat drops the files that
  # are matching any regular expression from the list. By default, no files are dropped.
  #prospector.scanner.exclude_files: ['.gz$']

  # Optional additional fields. These fields can be freely picked
  # to add additional information to the crawled log files for filtering
  #fields:
  #  level: debug
  #  review: 1

# ============================== Filebeat modules ==============================

filebeat.config.modules:
  # Glob pattern for configuration loading
  path: ${path.config}/modules.d/*.yml

  # Set to true to enable config reloading
  reload.enabled: false

  # Period on which files under path should be checked for changes
  #reload.period: 10s

# ======================= Elasticsearch template setting =======================

setup.template.settings:
  index.number_of_shards: 1
  #index.codec: best_compression
  #_source.enabled: false

setup.template.name: "sfwindextemplate"
setup.template.pattern: "sfwindex-*"
setup.template.enabled: true

setup.ilm.enabled: true
setup.ilm.policy_name: sfwilm


# ================================== General ===================================

# The name of the shipper that publishes the network data. It can be used to group
# all the transactions sent by a single shipper in the web interface.
#name:

# The tags of the shipper are included in their own field with each
# transaction published.
#tags: ["service-X", "web-tier"]

# Optional fields that you can specify to add additional information to the
# output.
#fields:
#  env: staging

# ================================= Dashboards =================================
# These settings control loading the sample dashboards to the Kibana index. Loading
# the dashboards is disabled by default and can be enabled either by setting the
# options here or by using the `setup` command.
#setup.dashboards.enabled: false

# The URL from where to download the dashboards archive. By default this URL
# has a value which is computed based on the Beat name and version. For released
# versions, this URL points to the dashboard archive on the artifacts.elastic.co
# website.
#setup.dashboards.url:

# =================================== Kibana ===================================

# Starting with Beats version 6.0.0, the dashboards are loaded via the Kibana API.
# This requires a Kibana endpoint configuration.
setup.kibana:

  # Kibana Host
  # Scheme and port can be left out and will be set to the default (http and 5601)
  # In case you specify and additional path, the scheme is required: http://localhost:5601/path
  # IPv6 addresses should always be defined as: https://[2001:db8::1]:5601
  host: "http://10.10.17.54:5601"

  # Kibana Space ID
  # ID of the Kibana Space into which the dashboards should be loaded. By default,
  # the Default Space will be used.
  #space.id:

# =============================== Elastic Cloud ================================

# These settings simplify using Filebeat with the Elastic Cloud (https://cloud.elastic.co/).

# The cloud.id setting overwrites the `output.elasticsearch.hosts` and
# `setup.kibana.host` options.
# You can find the `cloud.id` in the Elastic Cloud web UI.
#cloud.id:

# The cloud.auth setting overwrites the `output.elasticsearch.username` and
# `output.elasticsearch.password` settings. The format is `<user>:<pass>`.
#cloud.auth:

# ================================== Outputs ===================================

# Configure what output to use when sending the data collected by the beat.

# ---------------------------- Elasticsearch Output ----------------------------
output.elasticsearch:
  # Array of hosts to connect to.
  hosts: ["https://10.10.18.174:9200","https://10.10.18.215:9200"]
  worker: 8
  bulk_max_size: 3000
  # Protocol - either `http` (default) or `https`.
  protocol: "https"
  index: "sfwindex-%{+yyyy.MM.dd}"

  # Authentication credentials - either API key or username/password.
  # api_key: "id:api_key"
  username: "elastic"
  password: "elastic"
  ssl:
    enabled: true
    certificate_authorities: ["/etc/filebeat/certs/cert.pem"]
# ------------------------------ Logstash Output -------------------------------
#output.logstash:
  # The Logstash hosts
  #hosts: ["localhost:5044"]

  # Optional SSL. By default is off.
  # List of root certificates for HTTPS server verifications
  #ssl.certificate_authorities: ["/etc/pki/root/ca.pem"]

  # Certificate for SSL client authentication
  #ssl.certificate: "/etc/pki/client/cert.pem"

  # Client Certificate Key
  #ssl.key: "/etc/pki/client/cert.key"

# ================================= Processors =================================
processors:

  - timestamp:

      field: "timestamp"
      target_field: "@timestamp"
      layouts:
         - "UNIX_MS"

# ================================== Logging ===================================

# Sets log level. The default log level is info.
# Available log levels are: error, warning, info, debug
logging:
  level: debug
  to_files: true
  files:
    path: /var/log/filebeat
    name: filebeat
    keepfiles: 7

# At debug level, you can selectively enable logging only for some components.
# To enable all selectors use ["*"]. Examples of other selectors are "beat",
# "publisher", "service".
#logging.selectors: ["*"]

#path.logs: /var/log/filebeat

# ============================= X-Pack Monitoring ==============================
# Filebeat can export internal metrics to a central Elasticsearch monitoring
# cluster.  This requires xpack monitoring to be enabled in Elasticsearch.  The
# reporting is disabled by default.

# Set to true to enable the monitoring reporter.
#monitoring.enabled: false

monitoring:
  enabled: true
  elasticsearch:
    username: beats_system
    password: beats_system

# Sets the UUID of the Elasticsearch cluster under which monitoring data for this
# Filebeat instance will appear in the Stack Monitoring UI. If output.elasticsearch
# is enabled, the UUID is derived from the Elasticsearch cluster referenced by output.elasticsearch.
#monitoring.cluster_uuid:

# Uncomment to send the metrics to Elasticsearch. Most settings from the
# Elasticsearch output are accepted here as well.
# Note that the settings should point to your Elasticsearch *monitoring* cluster.
# Any setting that is not set is automatically inherited from the Elasticsearch
# output configuration, so if you have the Elasticsearch output configured such
# that it is pointing to your Elasticsearch monitoring cluster, you can simply
# uncomment the following line.
#monitoring.elasticsearch:

# ============================== Instrumentation ===============================

# Instrumentation support for the filebeat.
#instrumentation:
    # Set to true to enable instrumentation of filebeat.
    #enabled: false

    # Environment in which filebeat is running on (eg: staging, production, etc.)
    #environment: ""

    # APM Server hosts to report instrumentation results to.
    #hosts:
    #  - http://localhost:8200

    # API Key for the APM Server(s).
    # If api_key is set then secret_token will be ignored.
    #api_key:

    # Secret token for the APM Server(s).
    #secret_token:


# ================================= Migration ==================================

# This allows to enable 6.7 migration aliases
#migration.6_to_7.enabled: true

[root@cb-4 filebeat]#

Thanks,
Debasis

You already share this, you need to share how the event looks like in elasticsearch.

Edit: where is the decode_csv_fields processor in this file? You are not parsing the csv?

1 Like