High availability Ingest architecture

Context:

My team manages an on-prem Elastic deployment. We have an ECE license.
We had an outage a few weeks ago when our elastic nodes ran out of space caused by a problem with the move of data to the frozen layer.

Since then we have recovered the services but we lost data that couldn’t be ingested during the outage.

We are exploring enhancements to the current architecture and one of the first things that we came up with is the deployment of different logstash nodes with Persistent queues (PQ) enabled.

Nowadays, is that an enough enhancement to ensure that during an outage that takes hours to fix we don’t loose any data that the different data sources are sending to our Elastic deployment(Giving that the logstash nodes are still working during the outage)? Or do we need to introduce other elements like Kafka or Redis for this requirement?

Replaying events that were not processed by elasticsearch during the outage is a must.

Without lots of detail...
The tool of choice for the use case you're describing is often Kafka
We have many large scale durable architectures that are combination Logstash plus Kafka.

PQs could be a solution, it's kind of a smaller, perhaps less scalable more tightly coupled solution.

You'll have to consider some of your own trade-offs. Like do you want to manage another technology versus decoupling and probably a bit better durability?.

That's my thoughts. I'm sure others will have theirs.

I use Logstash + Kafka and recommend this as a good approach, I would not recommend PQ.

The problem with PQ is that they will require that your Logstash instances also have big and faster disks, which makes them expensive, every event will need to be written into disk then read from it before being processed, this also may increase the requirements for CPU in the logstash instances and it is something that cannot be done in parallel, which may impact in your ingestion rate.

Before 9.2 PQ also weren't able to compress the data, which used a lot more disk, but now it seems that you can compress the data in the PQ.

Kafka is widely used as a message buffer for ingestion flows.

How are you indexing your data? Describing your ingestion flows would make easier to provide more feedback.

1 Like