Pull Beat Data Across a Firewall

Hello All,

Just looking for a starting point on handling a DMZ firewall. I saw the guide on using logstash but it looks like it resides in the DMZ and pushes data to the cluster residing in our protected network. This won't work for us as policy does not allow for any system in the DMZ to initiate a connection to the protected network.

What is the most reliable way to handle this scenario? And are there how-tos I can review? I imagine the solution would be beat agents would still push to a logstash server residing in the DMZ, then a logstash server on the protected network would initiate the connection and "pull" the data collected on the logstash server residing in the DMZ?

Just looking for some experienced analysts to point towards the most reliable design.

Thanks,

Ray

Logstash does not store data, so you cannot have a logstash pulling data from another instance.

What you need is a message broker like Kafka to store the data received by one logstash and then configure the other logstash to pull the data from this Kafka cluster.

Something like this:

Agents --> ( Logstash -> Kafka Cluster ) <-- Logstash --> Elasticsearch.

Your agents would send your data to a Logstash in the DMZ that would then output it to a Kafka cluster also in the DMZ, then you would have another Logstash outside the DMZ using the Kafka input to consume those messages and send them to Elasticsearch.

Thank you for the help. Just to double check, I cannot have beat agents directly output to kafka? Also, there any good walkthroughs you would recommend installing and configuring kafka?

Yes, you can send directly from beats to Kafka, but sending to Logstash will give you more flexibility if you need to send the data to different topics, it is your choice.

I do not have any walkthrough about Kafka, but it is pretty easy to find it somewhere.

Thank you again! Last question, do I need zookeeper? If not, I am thinking I still need a Kafka cluster for failover. I think I can just configure yml to list all nodes / brokers in the cluster. Then on the protected network, pull with logstash with a second one as backup. Again the logstash pulls do not need zookeeper? - something like this?:

Zookeeper is used by Kafka to synchronize data between the brokers in the cluster, it has no relation to Logstash.

Currently on newer versions of Kafak you can run it without Zookeeper, you need to configure it to use the internal raft (kraft) to synchronize the brokers.

This is unrelation to Logstash, if your Kafka cluster is running it will be able to consume from it, how you will implement your Kafka cluster is out of the scope of this forum, but there are plenty of tutorials on how to spin-up a Kafka Cluster with or without ZooKeeper, look for something like "Running Kafka cluster with KRaft"

Thank you for the info, a huge help!

hi, i have the exact same challenge with data in a DMZ network area and can't (by policy) initiate connections from the DMZ to the network segment where Elasticsearch resides (lets call it ESNS). So am also looking for a solution where the connection is initiated from ESNS into the DMZ.

I was considering the websocket output in LS within DMZ, combined with a websocket input in LS on ESNS connection to that output. Any comments on if this solution might work? I'd have some concerns re. reliability / data loss.

Also it was mentioned earlier that LS does not store data, but this is not strictly true - we use a lot of LS persistent queues used a buffer layer between input and output, and these can (short term) store quite a lot of data! I haven't yet used these with an websocket output though - be interested if anyone has any comment on that? From the docs it sounds like if there weren't any websocket clients connected to the WS output, it would still drain and dump the PQ contents - which is not ideal for me! Technically it seems like it should be possible to have WS output not "consume" from PQ if there's no connected client, but im not sure that functionality exists right now? Thanks - appreciate any comments!

I am not sure of some of your acronyms (I am too new to this game) such as PQ, WS... My understanding is that beat agents can output easily to kafka. Though zookeeper is still used within the kafka cluster, the offset is not needed by the beat agent or logstash. So the yml config is an easy output option to kafka cluster. I am just researching how to secure connections. After that, I see a blog where someone is using logstash with keepalive with another logstash servers residing on your protected network. They PULL, initiate the connection from the kafka cluster residing in the DMZ. The only thing to consider is beat agents deduplicate such as filebeat handing log rotation, by thumbprint to kafka for each document. That way during logstash failover, elasticsearch does not enter duplicates.