I'm working on an architecture in which we will ingest transactions into Elasticsearch from our application servers, and want to make absolutely sure that we store 100% of the data. I just want to validate if my theory is correct, or if there is a better way of doing this.
Right now, there's a Filebeat agent on every application server, those ship the documents to a Logstash cluster, which will forward them to their specified destination (we have multiple ES stacks, and also do some filtering and data transformations in Logstash). We have persistent queue turned on for all instances, but as it is stated even in the documentation (and as it can be figured out by common sense), they don't protect against e.g. instance termination, disk corruption etc.
What I imagine I can do is generate a reproducable, unique ID for every document, and send them to multiple Logstash instances (>= 2). During normal operation, an instance which first recieves the document just writes it into Elasticsearch, and the subsequent ones will (intentionally) fail, so no duplication can happen - in theory (we're currently deciding between doing manual rollover or a data stream, but in this case it's most likely that we have to send documents with the same ID to the same underlying index so duplicate IDs can be detected, not sure if that's possible with a data stream, e.g. based on a timestamp field).
Is my theory correct, does it have any drawbacks or performance impacts (apart from obviously using twice as much resources in the Logstash cluster, which is still far more cheap than a larger ES cluster), and does significantly impact the ES performance, or is there any better way of achieving this?
I don't think that's possible since it seems to me that the logsatsh instances work independently from each other. How are you going to make your logstash 1 know for example that the logstash 2 has already processed the message x and that it should not be dropped?
One of the solutions - because I believe there are a lot of them - would be for example to use a kafka broker or redis in front of your agents. These deposit the data on your broker and the logstashes will consume what is deposited in the broker.
To summarize: you will send the same document with the same
_id to two Logstash instances. Each instance will send that same document with the same
_id to the Elasticsearch cluster , but only the first to arrive at the cluster will be sucessfully written, and the other will fail to write (as desired).
You can achieve this by specifying an action of "create" , which will ensure Elasticsearch will reject the document if it already exists. As this could generate a lot of error messages, you should whitelist this specific error - i.e. skip all 409 errors which are
document_already_exists_exception: Elasticsearch output plugin | Logstash Reference [7.15] | Elastic
Seems like it should work. Since the second copy of the document should be rejected quite early in the indexing process, I would expect sending the same document twice to have a very small impact on overall cluster performance. But you should test it to be sure.
Thanks for the replies! Streaming messages is indeed a great idea, most likely we will implement this by using Kafka and multiple consumer groups, or simply with a Kafka Elasticsearch sink connector.
Just one thing, the
_id field that you generated is only unique at the index level, not at cluster level, if you write to an index that has some kind of rollover, you still can get duplicates.
I would say that the same thing applies to data streams, as the back indices have rollovers.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.