Scaling logstash nodes

Hello,

We have oracle (single table) >> logstash >> elasticsearch. What is the way to scale horizontally logstash nodes to prevent the same data selection from the same source? (oracle table)

I don't think Logstash scales well on this case.

I'm assuming that you are using the jdbc input to query your oracle database, this plugin needs to store metadata about the last time it ran, this is stored on a file set by the configuration option last_run_metadata_path.

The last_run_metadata_path default value is a file inside the Logstash instance home folder.

What you could try is to set this value to a network shared folder and mount that folder in all the machines where you will run logstash, but this would not be enough, because you also have the schedule option, which can not be the same because you could have two or more instances trying to write into the same file at the same time.

You would also need to use different schedule options for each one of your instances and make sure that they do not overlap.

For example, you have one instance making a query every minutes and the other one making a query every two minutes, and your query would also need to take less the one minutes to run.

Maybe this could work, but you would need to test it with different schedule combinations.

2 Likes

Thank you for your answer. I have to pull the data every 1 second so I can not set different schedules.
What about linux Keepalived to keep both node (it work only for two node) sync, as soon as one node goes down I can update pipeline. What do you think will it work?

Keepalived would make sense if you were sending data to logstash, but your case is different, it is the logstash process that is making the query to your database, and it needs to keep track of the last queried data, so you can't have two logstash nodes making the same query at the same time.

Also, for what I remember, the schedule option in jdbc input uses the cron format and the lower time you get is to run it every minute, I don't think you would be able to run it every second.

One solution would be putting a Kafka cluster between your database and your logstash, you would need to use a jdbc connector to put your database data into the kafka cluster and then you could use as many logstash nodes you want, all of them consuming from kafka.

When you use the same group_id in your logstash input, Kafka will track the already consumed messages between the consumer group.

But this would also add another layer in your infrastructure.

Is there any option to sync oracle with ES with scaling feature? I need to sync for every one second and the data is more than 1000 insert per second.

Not in a native way, you will need to implement some connector between your database and logstash to fit your requirements.

The jdbc input plugin has a resolution down to the minute only, and to scale logstash you need to use other tools or services like HAProxy, Keepalived, Kafka, Redis etc.

You can for example write a python script to query your oracle database and send the data to Logstash using one of the available inputs, like tcp, udp or http or you can send it to a Kafka cluster and configure logstash to use the kafka input.

You can also write an API to query your oracle database and use the http_poller input to query this API, I think this way you can configure an schedule of every 1s.

If you want to send it directly to Elasticsearch you will can just skip the logstash part when you query your data, just need to sent it in a format that Elasticsearch will understand.

But either way, it is a logic that you will need to implement to fit your requirements.

Jdbc input plugin supports pull the data every seconds, it works.

For single node, will multiple worker work to pull the data from jdbc? Will they use the same last_run_metadata_path?