Pull data from another elasticsearch cluster using logstash

Kosodrom · January 10, 2022, 8:49pm

Hi folks,

I have a use case where I need to pull data from an external Elasticsearch cluster with the following Limitation: It is not allowed to open a conection to my internal elastic cluster (due to compliance regulations). That means I cannot just add another Output in the external logstash to forward data to my cluster.

I am considering following Options:

Run a Pipeline in my internal logstash to pull data using Elasticsearch input plugin. The Problem with that is that this plugin does not contain any logic to be able to note which data has already been shipped and which was not. I will end up getting whole indexes very frequently to make sure I have not missed any data, which can lead to Problems due to sizes of those Indexes.
Run a Pipeline in my internal logstash to pull data using TCP input plugin in client mode from external logstash running an TCP plugin as Output in Server mode. The Problem here is that TCP input plugin "destroys" previous document Format, which needs to be modified using some Filters.

So you see I am not really saticefied with any of these Options :(.

Maybe anyone one of you have any other ideas?

Many thanks for any input.

FALEN · January 11, 2022, 12:53pm

You can add query to Elasticsearch input plugin, this way you can limit ingested data, for example;

 input {
      # Read all documents from Elasticsearch matching the given query
      elasticsearch {
        hosts => "localhost"
        query => '{ "query": { "match": { "statuscode": 200 } }, "sort": [ "_doc" ] }'
      }
    }

or you can use gte:lte to limit time based

Kosodrom · January 13, 2022, 9:16am

Thank you very much for the response. Unfortunately I am not able to limit my request in such a way. I always need ALL documents because the goal is to collect all logs from "external" elastic in an "internal" elastic in a reliable way.

I could also limit my request by time frame, but then I have the risk that when I have a network outage I will lose data because the Elasticsearch input plugin does not know which document it has already read and which not.

Just in case someone is seeking for similar use case:

I think I came up now with the following solution:

With Filebeat I will add field to all events: document_read: false
In logstash I will run Elasticsearch input and get only those documents, which have document_read: false
In logstash I will have a filter and set document_read: true
I will have one output to send document to internal cluster
I will have second output to update the document by its _id in the external cluster so it will have document_read: true.

This solution allows you to

Introduce a procedure of "remembering" which documents you already have read and which not
Reduce the amount of documents requested from Elasticsearch cluster
No problems in case of network outages. It will just take longer for fetch all documents
When In case something is wrong with your pipeline no outputs will be executed and the documents will not be updated to document_read: true. So you are sure that when everything is fixed you will not lose any data.

Anyone has any opinions on that?

system · February 10, 2022, 9:16am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Re-Index data from elasticsearch back to another elasticsearch using Logstash Logstash	1	303	March 20, 2019
Elasticsearch input plugin - not pulling new records Logstash	2	651	August 9, 2017
Hook one elasticsearch with another ES using logstash Logstash	5	682	September 8, 2017
Forward Logs from Elasticsearch to external destination application Elasticsearch	8	3293	May 25, 2021
How this elasticsearch input plugin works in logstash? Logstash	4	3514	January 20, 2021

Pull data from another elasticsearch cluster using logstash

Related topics