Miscellaneous Questions on the back of Ricardo's all you want too know about filebeat

As I was working through Ricardo's Youtube video I ended with the below below question, I posted them in the Video comment section but thinking might get more traction reposting here.

I'm a total noob wrt EK still, and putting together the bits whats possible and not, whats good ideas and what is not.

#1
a technical question, the prospector's look for new files, is this based on name or a inode.. as with file rotation todays file is compressed and renamed tonight and a new file is then created with the same name, which implies the registry entry needs to be reset to line 0.

#2
hhehee, apologies for all the questions, noticed you also on a MAC, noticed you not doing a sudo on each command (re filebeat), did you change the ownership, allowing filebeat to operate, or did you do a sudo su - when were not lookin, as when you created new files you also never modified permissions.

#3
adding to the structured event, #1 you want to extract the main start and end as a event, what if the main "loop" include sub loops that you want to show thenself. thin a large batch starting with a start and end, but inside the large batch you have multiple looping processes that you want to show as they cycle, (and not wait for the main batch start/end ) to complete.

... for structured events, if the start and end includes a event id, can they be associated with each other,
in the current form of your example it plays to a batch process starting and ending, not to many transactions that can end being interlaced ?

#4
question, when shipping via kafka, how can you execute the kibana configuration, thinking you might have a setup where the sources (*beats) then only have access to the kafka brokers and not the elasticsearch or Kibana server / Network layers/subnets/firewalls... etc.

#5
... with one filebeat process running,I see we can specify the topic, based on a "when" clause, and I noticed to you can include a kafka message key (helping make sure all messages for a key (maybe message per file) is in same order on a topic (localised to a partition), question, in a scenario where I say don't want to use a kafka key, can I then split the output to different topics (or even indexes) based on the originating input file,

#6
... Hoping there is similar AuditBeat, PacketBeat, MetricBeat, WinLogBeat videos... if YES, please update the video text with links to them

#7
with heroes 04 example ... you pulled the config into a separate filebeats.yml file. this imply you will run 2 processes, or can you pull this into the main file, with this file still going to it's own idex/pipeline, and the other /var/log/*.log's index...
just thinking, you might have multiple files in the same directory, and you want each to go into it's own index, some single line, some multi line, some structured etc, ... expanding on this... i might want to have a single filebeat.yml processing running, but push each source log onto it's own kafka topic, to be then pushed via a Kafka Connector to it's own index.

G

1 Like

https://www.elastic.co/guide/en/beats/filebeat/current/multiline-examples.htmlHi @George_Leonard,

First of all, my apologies for taking some time to respond to your questions. I have been busy. I really appreciate you taking the time to organize your thoughts very clearly here. It takes a village. Now, without further a duo...

#1
a technical question, the prospector's look for new files, is this based on name or a inode.. as with file rotation todays file is compressed and renamed tonight and a new file is then created with the same name, which implies the registry entry needs to be reset to line 0.

I'm not so sure about the nature of the problem you are describing here, and if there is any whatsoever. But if you want to have fine control over how the prospector discover and handle new files and when it retires some entries in the registry file — you can take a look in the clean options available in Filebeat.

#2
hhehee, apologies for all the questions, noticed you also on a MAC, noticed you not doing a sudo on each command (re filebeat), did you change the ownership, allowing filebeat to operate, or did you do a sudo su - when were not lookin, as when you created new files you also never modified permissions.

Good catch :wink:

In fact, by the time I recorded that video I was using Fedora Linux, and in my machine I had this ownership thing sorted out by some group permissions. However, the approach to solve this is fairly simple — you just need to provide the right administrative permissions to whoever is going to run Filebeat. Alternatively, you can disable this permissions checks as mentioned here.

#3
adding to the structured event, #1 you want to extract the main start and end as a event, what if the main "loop" include sub loops that you want to show thenself. thin a large batch starting with a start and end, but inside the large batch you have multiple looping processes that you want to show as they cycle, (and not wait for the main batch start/end ) to complete.

... for structured events, if the start and end includes a event id, can they be associated with each other,
in the current form of your example it plays to a batch process starting and ending, not to many transactions that can end being interlaced ?

This use case of yours is possible with the multiline configuration but will require extra fine-tuning. Specifically, you would need to leverage the sub-options negate, match, flush_pattern, max_lines, and count_lines. It all boils down to what you consider as the concept of an event for your use case. You can have examples of this in the Elastic documentation here.

#4
question, when shipping via kafka, how can you execute the kibana configuration, thinking you might have a setup where the sources (*beats) then only have access to the kafka brokers and not the elasticsearch or Kibana server / Network layers/subnets/firewalls... etc.

This is why Beats separate the configuration of the output from Kibana. You can still configure your Kibana server in the configuration file; while the output can point to something other than Elasticsearch. That way, you can run the setup command and still be able to load up the indices, templates, and dashboards.

#5
... with one filebeat process running,I see we can specify the topic, based on a "when" clause, and I noticed to you can include a kafka message key (helping make sure all messages for a key (maybe message per file) is in same order on a topic (localised to a partition), question, in a scenario where I say don't want to use a kafka key, can I then split the output to different topics (or even indexes) based on the originating input file,

In the case of using Kafka as output, you have some options here. As you may know, Kafka supports the concept of partitioning in the producer side, where you can either specify a key and then the hash of this key will be used as a routing strategy to the partition, or you can explicitly specify a different partitioning by choosing either the option random or round_robin. Out of these 3 options, the only option that gives you control over which partition will be chosen is the hash by key.

The other way to solve this problem is using Logstash in between Filebeat and Kafka. You can use Logstash as your output and then create a more sophisticated routing rule based on data content. This will be a trade-off between flexibility and simplicity — as you will end up with another distributed system to manage.

#6
... Hoping there is similar AuditBeat, PacketBeat, MetricBeat, WinLogBeat videos... if YES, please update the video text with links to them

There are two more series, for Metricbeat and Heartbeat respectively. You may want to subscribe to the Elastic Community Channel to be notified about these series :slightly_smiling_face:

#7
with heroes 04 example ... you pulled the config into a separate filebeats.yml file. this imply you will run 2 processes, or can you pull this into the main file, with this file still going to it's own idex/pipeline, and the other /var/log/*.log's index...
just thinking, you might have multiple files in the same directory, and you want each to go into it's own index, some single line, some multi line, some structured etc, ... expanding on this... i might want to have a single filebeat.yml processing running, but push each source log onto it's own kafka topic, to be then pushed via a Kafka Connector to it's own index.

You can run only one Filebeat process that can create different prospectors. An organized way to accomplish this is having your custom configuration files in the modules.d folder and have them not suffixed with .disabled. That way Filebeat will read them automatically. I don't see many advantages in having multiple Filebeat processes — one for each configuration file — other than being able to file control the durability of each process as each one of them will have its own .data folder. Filebeat supports different layouts, it is more up to you which one to pick from.

Cheers,

@riferrei

No problem. thanks for taking time to create a awesome video.

let me try and clear my question up, the prospector scans a directory for files, get a file name, and then start harvesting that file, so for file myapp.log it knows it's harvested to line 500, mid night comes, and the directory has been tag for file rotation, aka renaming and then touching a new empty my app.log. the next time the harvester tries to read the file, the file/filename is still there, but it's a new file, which started from empty. so as there is maybe no line 500, does the harvester default to knowing this must be a new file so it goes to line 0...
Re the old file, I'm assuming we accept that maybe if the have configured the harvester to run every 5 min, and the last time it ran was 23:46. file got renamed at 0:00 that we might loose those 4 min worth of data.

Imagine a scenario where the source machine can not get to 5601, under any circumstances. the source can only get to the broker publish port 9201.
Is there a manual way to go tag a newly created index as a "metric beats" index and then via a user process create the associated dashboard configurations, which would normally have been done by the filebeat process.

simple way to explain this, I want to use the when clause, to define the topic to push the log to based on the input filename. I don't want to do this with SMT as that implies messages are read from a already topic and then republished onto a 2nd topic based on the SMT definition.

[quote]
There are two more series, for Metricbeat and Heartbeat respectively. You may want to subscribe to the Elastic Community Channel to be notified about these series :slightly_smiling_face:
[\quote]
ALREADY Subscribed, will go look for them...
can I hit for a audit beat and packet beat also pleaseeeee

Just wondering/exploring how to some times consume a large elephant...

Thanks.

G

let me try and clear my question up, the prospector scans a directory for files, get a file name, and then start harvesting that file, so for file myapp.log it knows it's harvested to line 500, mid night comes, and the directory has been tag for file rotation, aka renaming and then touching a new empty my app.log. the next time the harvester tries to read the file, the file/filename is still there, but it's a new file, which started from empty. so as there is maybe no line 500, does the harvester default to knowing this must be a new file so it goes to line 0...
Re the old file, I'm assuming we accept that maybe if the have configured the harvester to run every 5 min, and the last time it ran was 23:46. file got renamed at 0:00 that we might loose those 4 min worth of data.

Your understanding of how Filebeat manages entries on the registry file is not incorrect. The only point to remember is that this won't be a problem for you. In your case, considering that the file app-1.log has been fully processed and finished on line 500; the file app-2.log will be created and from your app perspective — line 501 goes to the new file. But from Filebeat perspective, it will simply read whatever lines come after in the sequence of processing of that input, which regardless of being another file it will continue on the event 501 now stored on line 0 of the new file. Filebeat ensures at-least-once semantic of writes, so regardless of the file rotation, no events will be lost. It is the same principle of Kafka durable logs, where each even has a pointer/offset that marks the position within the buffer to be processed. What goes to the registry file is exactly that pointer/offset.

Imagine a scenario where the source machine can not get to 5601, under any circumstances. the source can only get to the broker publish port 9201.
Is there a manual way to go tag a newly created index as a "metric beats" index and then via a user process create the associated dashboard configurations, which would normally have been done by the filebeat process.

Got it. You can always go straight to Kibana and load any objects/dashboards you want from there. FYI, all the Kibana objects that a Beat might create can be found in the /kibana folder. As for the indices themselves; you can create them administratively using the Kibana console by going to the Index Mgmt section — or just issue an HTTP request to create them.

simple way to explain this, I want to use the when clause, to define the topic to push the log to based on the input filename. I don't want to do this with SMT as that implies messages are read from a already topic and then republished onto a 2nd topic based on the SMT definition.

I see, you want to get away from Kafka Connect and its annoying SMT's :upside_down_face:. I can't disagree because I don't like them either. Try to use the WHEN clause with the Beats and let me know if that works for you. Maybe this could even become a blog or something.

Cheers,

@riferrei

1 Like

Hi Ricardo.

Think we still disconnected here...
Allot of apps that does file rotation, and the OS log themselves ... like syslog, the current log into which entries are written is always called syslog.log, and the previous day is syslog1.log.

so if I was writing to syslog today, and got to line 500 in your example, that file now gets renamed to syslog1.log tonight, a new syslog get's touched... and when I now write it becomes line 0... but as per your example entry 501.
Get the idea... it some how not just record the line number it's read and pushed to Elastic. it also says this line number existed in a inode definition for the file, so that it knows entries 1-500 is now in syslog1.log and although the name is still syslog, it is a new file and it must start from line 0, but that might be as simple as it noticing there is no more line 500 or 400 so it might just assume it's a new file, and update the file name in the data folder with the new line number up to where it has processed.

hehehe I actually think Kafka and Connect is extremely powerful, but it is a waste of only used for one data stream, it's power is when you start consolidating... aka push your logs over it, and also other data streams. and then have Elastic be one down stream target, the DWH say a 2nd.

Just for this, you might want to partition that topic based on the source file name. which implies you need to inject the file name at some point as the key. doing that in a SMT once the data is already on a topic does not make sense as you now basically create a 2nd topic with the data properly partitioned. I'd say that should be something you should be able to configure in the filebeat.yml config. take advantage of it being able to inject a key value, but now being able to define what you want that key to be, aka the source file name, or maybe even the source hostname + file name...
Remember the down stream consumer is not necessarily only going to be ElasticSearch.

and ye... think I need to read loads more and see, this will def be a interesting Blog... hint hint.
G

Yes, it's inode based so file rotations are covered — no logs should be lost or duplicated because of it. Or at least in most cases — your scenario should be fine, but there are two potential issues with log rotations:

You can also check out the current state and how inode + device form the unique ID in Filebeat's registry folder; this is from my Mac and I think Filebeat 7.13:

[
  {
    "_key": "filebeat::logs::native::12926464645-83",
    "timestamp": [
      281471217189239,
      1594021549
    ],
    "FileStateOS": {
      "inode": 12926464645,
      "device": 83
    },
    "identifier_name": "native",
    "id": "native::12926464645-83",
    "source": "/mnt/logs/java-logging.log",
    "ttl": -2,
    "type": "log",
    "prev_id": "",
    "offset": 9378
  }
]

For the Kafka topics you shouldn't even need conditionals (though they are supported), since the topic name supports references. Something like topic: "logs-%{[log.source.address]}-%{[beat.hostname]}-%{[agent.version]}" (totally untested, but log fields and Beat fields should be your starting points).
If that's a great naming pattern (or will explode your number of indices / shards) is a bit of a different story though :sweat_smile:

1 Like

Ye... creating the topic names (and indexes) based on file name just opens a couple of different cans of worms, one being the explosion of topics, a much cleaner solution is to have a single pre configure topic and then key the data based on say a file name,
so... now have to see if that %hostname can be injected some how as the key.

G

+1 on this, George. By using this approach you can properly size the # of partitions the topic needs, which might be set to the # of hosts you have. Creating the topics dynamically has the downside of you not being able to dynamically set the # of partitions. And all of this dictates directly how fast Filebeat will read off records from Kafka.

Cheers,

@riferrei

1 Like