Logstash concurrency - managing multiple LS nodes that share the same pipeline configuration

AlessandroKP · October 21, 2019, 11:08am

Hi all,

I have a question regarding concurrency management in Logstash.

My setup is the following. I have two Logstash nodes that run the same pipeline, i.e. they use the same pipeline configuration.
They perform exactly the same scheduled operations: they get their input from a common source, make the same processing and then output the data to the same Elasticsearch cluster.
Since a Logstash instance always works in a standalone mode, I would like to avoid that the pipeline is executing the same activity at the same time by both instances.

I was thinking of solving this by implementing a "semaphore" index on Elasticsearch.

The semaphore implemention would work as follows.
In the input section of the Logstash pipeline, I query the semaphore index in ES for any document, with a certain periodicity (by using the elasticsearch input plugin).
If the query returns any documents, then the pipeline drops the current event.
If the query returns nothing, then the pipeline does the following in the filter section:
- creates a semaphore document in the semaphore index
- does the usual elaboration by enriching the input event
- after all activities have been done, remove the semaphore document
Then, it outputs the event to Elasticsearch.

This solution seems a bit cumbersome. Do you know if there is a simpler way to do it?

Alessandro.

AquaX · October 23, 2019, 8:51pm

You could also define a common "document_id" and just have each logstash overwrite the document. Then it won't matter which one fires first or second because if Logstash A went first and it writes a document with ID ABC123 then when Logstash B runs and it uses the same document ID of ABC123 it will just overwrite the changes.

ea1987 · October 25, 2019, 9:47am

Hi Aquax,
that's not what we would like to do. I will try to explain this use case in a different way.

Pipeline check_agent: this pipeline has a schedule HTTP input which is used to perform request to a specific external agent. This pipeline is stored on a shared NAS.

Logstash_instance_1 (running on host A): running pipeline check_agent
Logstash_instance_2 (running on host B): running pipeline check_agent

Since we would like to avoid external agent overhead due to HTTP requests received from both logstash instances, we implemented a kind of concurrency control of the schedule execution using a custom index with a semaphore document (kept update using a specific _id) used to sync both pipeline instances execution, like Alessandro explained in the previous post.

Is this the best approach or are there other better ones?

system · November 22, 2019, 9:47am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Multiple pipelines output to one cluster - only one of them picks up logs Logstash	5	1265	January 4, 2021
Handle concurrency while updating different records of the same id from json file Logstash	5	20	September 11, 2024
Load balancing using logstash Logstash elastic-stack-monitoring	3	509	April 8, 2024
Multiple logstash configs with different filters running at the same time Logstash	7	563	July 24, 2020
Logstash HTTP Filter sending once per node Logstash	3	94	April 11, 2024

Logstash concurrency - managing multiple LS nodes that share the same pipeline configuration

Related topics