I have a use case where I need to pull data from an external Elasticsearch cluster with the following Limitation: It is not allowed to open a conection to my internal elastic cluster (due to compliance regulations). That means I cannot just add another Output in the external logstash to forward data to my cluster.
I am considering following Options:
Run a Pipeline in my internal logstash to pull data using Elasticsearch input plugin. The Problem with that is that this plugin does not contain any logic to be able to note which data has already been shipped and which was not. I will end up getting whole indexes very frequently to make sure I have not missed any data, which can lead to Problems due to sizes of those Indexes.
Run a Pipeline in my internal logstash to pull data using TCP input plugin in client mode from external logstash running an TCP plugin as Output in Server mode. The Problem here is that TCP input plugin "destroys" previous document Format, which needs to be modified using some Filters.
So you see I am not really saticefied with any of these Options :(.
Thank you very much for the response. Unfortunately I am not able to limit my request in such a way. I always need ALL documents because the goal is to collect all logs from "external" elastic in an "internal" elastic in a reliable way.
I could also limit my request by time frame, but then I have the risk that when I have a network outage I will lose data because the Elasticsearch input plugin does not know which document it has already read and which not.
Just in case someone is seeking for similar use case:
I think I came up now with the following solution:
With Filebeat I will add field to all events: document_read: false
In logstash I will run Elasticsearch input and get only those documents, which have document_read: false
In logstash I will have a filter and set document_read: true
I will have one output to send document to internal cluster
I will have second output to update the document by its _id in the external cluster so it will have document_read: true.
This solution allows you to
Introduce a procedure of "remembering" which documents you already have read and which not
Reduce the amount of documents requested from Elasticsearch cluster
No problems in case of network outages. It will just take longer for fetch all documents
When In case something is wrong with your pipeline no outputs will be executed and the documents will not be updated to document_read: true. So you are sure that when everything is fixed you will not lose any data.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.