Scaling logstash nodes

David_Beradze · December 29, 2020, 10:13am

Hello,

We have oracle (single table) >> logstash >> elasticsearch. What is the way to scale horizontally logstash nodes to prevent the same data selection from the same source? (oracle table)

leandrojmp · December 29, 2020, 11:19am

I don't think Logstash scales well on this case.

I'm assuming that you are using the jdbc input to query your oracle database, this plugin needs to store metadata about the last time it ran, this is stored on a file set by the configuration option last_run_metadata_path.

The last_run_metadata_path default value is a file inside the Logstash instance home folder.

What you could try is to set this value to a network shared folder and mount that folder in all the machines where you will run logstash, but this would not be enough, because you also have the schedule option, which can not be the same because you could have two or more instances trying to write into the same file at the same time.

You would also need to use different schedule options for each one of your instances and make sure that they do not overlap.

For example, you have one instance making a query every minutes and the other one making a query every two minutes, and your query would also need to take less the one minutes to run.

Maybe this could work, but you would need to test it with different schedule combinations.

David_Beradze · December 29, 2020, 1:12pm

Thank you for your answer. I have to pull the data every 1 second so I can not set different schedules.
What about linux Keepalived to keep both node (it work only for two node) sync, as soon as one node goes down I can update pipeline. What do you think will it work?

leandrojmp · December 29, 2020, 1:48pm

Keepalived would make sense if you were sending data to logstash, but your case is different, it is the logstash process that is making the query to your database, and it needs to keep track of the last queried data, so you can't have two logstash nodes making the same query at the same time.

Also, for what I remember, the schedule option in jdbc input uses the cron format and the lower time you get is to run it every minute, I don't think you would be able to run it every second.

One solution would be putting a Kafka cluster between your database and your logstash, you would need to use a jdbc connector to put your database data into the kafka cluster and then you could use as many logstash nodes you want, all of them consuming from kafka.

When you use the same group_id in your logstash input, Kafka will track the already consumed messages between the consumer group.

But this would also add another layer in your infrastructure.

David_Beradze · December 30, 2020, 5:03pm

Is there any option to sync oracle with ES with scaling feature? I need to sync for every one second and the data is more than 1000 insert per second.

leandrojmp · December 30, 2020, 5:29pm

Not in a native way, you will need to implement some connector between your database and logstash to fit your requirements.

The jdbc input plugin has a resolution down to the minute only, and to scale logstash you need to use other tools or services like HAProxy, Keepalived, Kafka, Redis etc.

You can for example write a python script to query your oracle database and send the data to Logstash using one of the available inputs, like tcp, udp or http or you can send it to a Kafka cluster and configure logstash to use the kafka input.

You can also write an API to query your oracle database and use the http_poller input to query this API, I think this way you can configure an schedule of every 1s.

If you want to send it directly to Elasticsearch you will can just skip the logstash part when you query your data, just need to sent it in a format that Elasticsearch will understand.

But either way, it is a logic that you will need to implement to fit your requirements.

David_Beradze · December 30, 2020, 7:43pm

Jdbc input plugin supports pull the data every seconds, it works.

For single node, will multiple worker work to pull the data from jdbc? Will they use the same last_run_metadata_path?

system · January 27, 2021, 7:43pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Scaling Logstash Logstash	10	1513	July 20, 2017
Logstash instance slave failover Logstash	7	1511	March 27, 2017
Question regarding Logstash Horizontal Scaling Logstash	5	2432	June 3, 2019
Running Logstash on multiple servers, avoiding double processing Logstash	2	200	August 7, 2023
Make logstash run on two nodes Logstash	6	499	March 1, 2019

Scaling logstash nodes

Related topics