Pull data from another elasticsearch cluster using logstash

Hi folks,

I have a use case where I need to pull data from an external Elasticsearch cluster with the following Limitation: It is not allowed to open a conection to my internal elastic cluster (due to compliance regulations). That means I cannot just add another Output in the external logstash to forward data to my cluster.

I am considering following Options:

  1. Run a Pipeline in my internal logstash to pull data using Elasticsearch input plugin. The Problem with that is that this plugin does not contain any logic to be able to note which data has already been shipped and which was not. I will end up getting whole indexes very frequently to make sure I have not missed any data, which can lead to Problems due to sizes of those Indexes.

  2. Run a Pipeline in my internal logstash to pull data using TCP input plugin in client mode from external logstash running an TCP plugin as Output in Server mode. The Problem here is that TCP input plugin "destroys" previous document Format, which needs to be modified using some Filters.

So you see I am not really saticefied with any of these Options :(.

Maybe anyone one of you have any other ideas?

Many thanks for any input.

You can add query to Elasticsearch input plugin, this way you can limit ingested data, for example;

 input {
      # Read all documents from Elasticsearch matching the given query
      elasticsearch {
        hosts => "localhost"
        query => '{ "query": { "match": { "statuscode": 200 } }, "sort": [ "_doc" ] }'

or you can use gte:lte to limit time based

1 Like

Thank you very much for the response. Unfortunately I am not able to limit my request in such a way. I always need ALL documents because the goal is to collect all logs from "external" elastic in an "internal" elastic in a reliable way.

I could also limit my request by time frame, but then I have the risk that when I have a network outage I will lose data because the Elasticsearch input plugin does not know which document it has already read and which not.

Just in case someone is seeking for similar use case:

I think I came up now with the following solution:

  1. With Filebeat I will add field to all events: document_read: false
  2. In logstash I will run Elasticsearch input and get only those documents, which have document_read: false
  3. In logstash I will have a filter and set document_read: true
  4. I will have one output to send document to internal cluster
  5. I will have second output to update the document by its _id in the external cluster so it will have document_read: true.

This solution allows you to

  1. Introduce a procedure of "remembering" which documents you already have read and which not
  2. Reduce the amount of documents requested from Elasticsearch cluster
  3. No problems in case of network outages. It will just take longer for fetch all documents
  4. When In case something is wrong with your pipeline no outputs will be executed and the documents will not be updated to document_read: true. So you are sure that when everything is fixed you will not lose any data.

Anyone has any opinions on that?