Need help to setup filebeat

Hello I want to use filebeat to import my logs to elasticsearch.
I have a python script which collects data from 30 sensors and stores data from each sensor to his own directory and log (A-B/0-14/YYYY-MM-dd.log) every 1 minute
here is an example of a line in the log 2018-02-15 10:05:37 - Temp:24.5 Humidity:48.5
how do I send the data from each folder to its own index(a0,a1,a2,b0,b14 etc..)
Its to be noted that the local files are in a windows machine and I use filebeat on windows
Thanks a lot in advance

I need help configuring the pipeline and grok

As you are in control of the script, I'd recommend to output JSON. Filebeat input can parse json, allowing you to dereference and/or filter by fields. Then you won't need grok or ingest pipeline :slight_smile:

1 Like

okay I've made it to output reading from each sensor to its own JSON file the output looks like this

"reading": {
"timestamp": "2018-02-12 01:14:13",
"temperature": 24.4,
"humidity": 45.5,
}

how do I configure Filebeat to output all the JSON files from the separate folders to separate indexes?
Any help will be highly appreciated thanks in advance :slight_smile:

Can you share some more detail about your setup?

To me it reads like you have this kind of layout:

${data_root}/A/0/*.log
${data_root}/A/1/*.log
...
${data_root}/B/0/*.log
${data_root}/B/1/*.log
...

I guess your filebeat prospector configuration looks like this:

filebeat.prospectors:
- type: log
  paths:
    - "${data_root}/*/*/*.log"

Filebeat puts the full file path into the source field. Unfortunately you can't have some custom string processing in filebeat.

For the Elasticsearch output we have the index and the indices settings.

Using the index setting, one normally sets some common field in the prospector config like this:

filebeat.prospectors:
- type: log
  paths:
    - "${data_root}/A/0/*.log"
  fields.category: "a0"
  fields_under_root: true

- type: log
  paths:
    - "${data_root}/A/0/*.log"
  fields.category: "a0"
  fields_under_root: true

...

output.elasticsearch:
  index: '%{[category]}-%{+yyyy.MM.dd}'
  ...

Using string formatting, you might use two fields. One for type and the other id/zone.

But no matter whether you use index or indices, you would have quite some repetition in your config file.

Using Logstash, or Ingest Node you could try to grok/parser the path. In Ingest node the index name can be rewritten by overwriting the _index field.

Alternatively you could have your script set the category field(s) when writing the event. That is, each line/event is completely self-contained. No additional processing required, due to part of the event meta-data encoded out-of-band.
Advantage is, the script already knows the category/type by path, and this removes some of the repetition in the config file, or setup in general (at the cost of some more bytes being written to the file). In case you go for this solution, I wonder if you really need multiple indices or if being able to filter/query by category field in kibana would do the trick. With having only one index, you can have the index template being managed by filebeat (having multiple indices, you have to manage mappings/templates for each index by yourself).
Having identifying meta-data in the event, you might also reconsider the number of log files/directories you will need.

1 Like

Hey @steffens Thanks for the reply and the help mate,
I do have this kind of layout as you mentioned:

${data_root}/A/0/*.log
${data_root}/A/1/*.log
...
${data_root}/B/0/*.log
${data_root}/B/1/*.log

Regarding my setup, I run the latest elk image from bitnami on AWS, and on the local machine I have windows 10 with a python script that reads from the sensors and filebeat installed.

Every log file gets a new reading added every 1 minute, every day a new log file is created
I have set up the prospectors as you said but I have no clue how to integrate it to work with Logstash or an ingest node since this is all kinda new to me.
I can output all the readings to one file/directory and add identifying meta-data in the event and with that said, I have no problem setting all the mapping/templates for each index as long as I have the know-how, which sadly I don't have, yet :slight_smile:

Here is the script that saves the logs for refrence: https://gist.github.com/5832b448cbd384261389a900f5fe3568.git

Thanks a lot for the help, can't thank you enough

The script almost looks like an use-case for metricbeat. But then I guess you are reading from some 'special' devices/boards with customized non-standardized contents encoding.

Given all data have the very same schema + you don't have that many data, personally I would put all data into one single index in elasticsearch. Also reduces total number of shards you will end up with.

Any reasons for wanting multiple indices?

I'd also split the collection + reporting/writing into separate threads, with one thread collecting and writing all data from the worker loops into one daily log file. By querying once per minute, you will have only 43200 = 30 * 24 * 60 events per day (given you have a total of 30 devices).

Fields I would report: board, identifier, @timestamp, temperature, humidity.
By having board and identifier in the event, you can configure per board index by configuring: output.elasticsearch.index: '%{[board]}%{[identifier]}-%{yyyy.MM.dd}'.

In beats/logstash the default event timestamp is @timestamp. If filebeat finds @timestamp in the json event, it tries to parse the timestamp (parsing is pretty strict). If your document uses another timestamp name, filebeat will add @timestamp, which is basically the read timestamp (nice to have to check for latencies between event collection and filebeat picking up the events).

Beats 6.x introduces the fields.yml file for configuring the event schema. This file is used by beats to:

  • create the Elasticsearch index template (compatible to the Elasticsearch version in use)
  • create kibana index mapping (compatible to the Kibana version in use)

If fields are missing from the template or you don't have a template, Elasticsearch tries to infer the index mapping. This works out ok in most of the cases. But if Elasticsearch detects the wrong type at index creation, indexes some other events might fail later on. That is, to some degree setting up the template is optional :slight_smile:

The settings for configuring the index template are documented here. If you have more then one index, you should disable automatic template setup by setting setup.template.enabled: false. The getting started guide shows some different strategies on managing templates. Using the manual apprach, it is up to you if you want to make use of fields.yml or modify the generated json file.

Hey @steffens thanks for the help again.
Well... I had some trouble configuring it to work.
so I made some changes after a lot of failures,
currently I modified the script to output back in a text format from all the sensors/boards into one file while adding another field that uses to identify the board and the output of it looks like this:

2018-03-05 02:46:48 25.50 27.10 A0
2018-03-05 02:46:48 25.80 52.00 A1
2018-03-05 02:46:48 26.10 51.30 A2
2018-03-05 02:46:52 25.50 27.10 A0
2018-03-05 02:46:52 25.80 52.00 A1

after I finally was able to import the data into elasticsearch,
I opted to use a pipeline ingest node in order to separate it to different indexes, the only problem is that I don't know how to do it :sweat_smile: and I got this so far:

PUT _ingest/pipeline/test-pipeline
{
  "description" : "A test pipeline",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": (?<timestamp>%{YEAR}-%{MONTHNUM:month}-%{MONTHDAY:day} %{TIME}) %{NUMBER:temperature:float} %{NUMBER:humidity:float} %{GREEDYDATA:sid}
      }
    }
  ]
}

I'm 90% sure that the %{GREEDYDATA:sid} isn't the right type for this data but I couldn't find and/or figure out how grok works properly :sweat:

I want to to save the data to an index named after sid(a0,a1,a2).
I'm pretty close to make it work.. I think ><"
please help thanks in advance
Edit: I was able to output it to the ingest node, but it doesn't change the timestamp into a date format so I cant use that :frowning: how do I make it work with my timestamp value?
here is a json view of an ingested event:

{
"_index": "test",
"_type": "doc",
"_id": "S7bX82EBGPjS5sDswIAK",
"_version": 1,
"_score": 2,
"_source": {
"offset": 70,
"prospector": {
"type": "log"
},
"source": "/var/log/kaki/3.log",
"message": "2018-03-05 03:42:10 25.50 27.80 A0",
"tags": [
"json"
],
"sid": "A0",
"@timestamp": "2018-03-05T01:47:37.312Z",
"month": "03",
"beat": {
"hostname": "ip-1",
"name": "ip-172",
"version": "6.2.2"
},
"temperature": 25.5,
"humidity": 27.8,
"fields": {
"category": "a0"
},
"day": "05",
"timestamp": "2018-03-05 03:42:10"
},
"fields": {
"@timestamp": [
"2018-03-05T01:47:37.312Z"
]
}

Do not parse the timestamp in your grok pattern. Check out the filebeat modules for sample ingest node scripts. E.g. have a look at the nginx pipeline.

The @timestamp field must be a string. Elasticsearch will parse it as timestamp for you.

How did you end up with @timestamp being an array in fields?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.