I have a question regarding concurrency management in Logstash.
My setup is the following. I have two Logstash nodes that run the same pipeline, i.e. they use the same pipeline configuration.
They perform exactly the same scheduled operations: they get their input from a common source, make the same processing and then output the data to the same Elasticsearch cluster.
Since a Logstash instance always works in a standalone mode, I would like to avoid that the pipeline is executing the same activity at the same time by both instances.
I was thinking of solving this by implementing a "semaphore" index on Elasticsearch.
The semaphore implemention would work as follows.
In the input section of the Logstash pipeline, I query the semaphore index in ES for any document, with a certain periodicity (by using the elasticsearch input plugin).
If the query returns any documents, then the pipeline drops the current event.
If the query returns nothing, then the pipeline does the following in the filter section:
- creates a semaphore document in the semaphore index
- does the usual elaboration by enriching the input event
- after all activities have been done, remove the semaphore document
Then, it outputs the event to Elasticsearch.
This solution seems a bit cumbersome. Do you know if there is a simpler way to do it?
You could also define a common "document_id" and just have each logstash overwrite the document. Then it won't matter which one fires first or second because if Logstash A went first and it writes a document with ID ABC123 then when Logstash B runs and it uses the same document ID of ABC123 it will just overwrite the changes.
Hi Aquax,
that's not what we would like to do. I will try to explain this use case in a different way.
Pipeline check_agent: this pipeline has a schedule HTTP input which is used to perform request to a specific external agent. This pipeline is stored on a shared NAS.
Logstash_instance_1 (running on host A): running pipeline check_agent Logstash_instance_2 (running on host B): running pipeline check_agent
Since we would like to avoid external agent overhead due to HTTP requests received from both logstash instances, we implemented a kind of concurrency control of the schedule execution using a custom index with a semaphore document (kept update using a specific _id) used to sync both pipeline instances execution, like Alessandro explained in the previous post.
Is this the best approach or are there other better ones?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.