How this elasticsearch input plugin works in logstash?


I am trying to transfer data from one AWS ES cluster to another AWS ES cluster and in this case, i am going to use the schedule option to pull the data from input every time.

I would like to understand how this elasticsearch input plugin really works here - meaning for every interval it will try to read all the data (from the first document to the latest document) from the source ES cluster or only modified documents will get selected and transfer to another ES cluster. how can I understand this better because I am going to have huge data on my source ES cluster so it is really important to understand how this input plugin reads and transfers data.

Also is there any debug option available for the input plugins in logstash?
sample logstash configuration file


 input {
      elasticsearch {
        hosts => "source ES cluster endpoint"
        index => "v-demo-*"
        schedule => "* * * * *"
        user => 'xxxxxx'
        password => 'xxxxx'
        size => 500
        scroll => "5m"
        docinfo => true
      }
    }
    output {
      elasticsearch {
        hosts => "Destination ES cluster endpoint"
        index => "%{[@metadata][_index]}"
        document_id => "%{[@metadata][_id]}"
        ssl => true
        user => 'xxxxx'
        password => 'xxxxxx'
      }
    }

As far as I know, the default query will read the entire index. That means you will get another copy of the entire index every interval.

There have been an issue open for years that suggest that there should be some kind of "sincedb" feature or functionality similar to the jdbc input to use a tracking column. Nobody has implemented it. Feel free to submit a PR that does so. Do not expect elastic to do so. (I am a logstash user. I in no way speak for elastic.)

In a past life I consumed huge daily indexes into different, filtered and enriched daily indexes by running logstash once a day to process the previous day's index. That worked well because elasticsearch was segmenting the data, so eating the entire index from the previous day was perfect.

If you need to run a query on a schedule you may be able to do something using date math to limit the data returned by the qeury. You could then use the [@metadata][_id] field to limit duplication. For example, if you run a query once a minute you could have the query limit date to the last minute, or the last two minutes (meaning half of it will get overwritten) or the last 10 minutes (meaning 90% will get overwritten). How much do you care about missing data? If 0% loss is the only acceptable answer it is a really, really hard question.

Thanks for the response @Badger.
with a small set of indexes it is okay to read the entire index set again and again but with larger data don't think this is a good idea.
then how this output elastic search plugin works? meaning as the input plugin is reading the entire index every time does the output plugin overwriting it always?
one more observation with the plugin - if I delete any index in the source elastic search cluster then plugin reads the new indexes but it won't delete that particular index from the destination elastic search. based on this it looks like the output plugin doesn't overwrite everything all the time. looks like it does look for the modified or new one but I don't see any such confirmation in any of the output plugin document.
do you have any idea?

The elasticsearch output will overwrite a document if it has the same document id. That's why you would have the elasticsearch input set the metadata and have the elasticsearch output use the document id from that metadata.

The elasticsearch output indexes documents. It has no idea that it is re-indexing documents, so if a document is not sent to it then it has no way of knowing that it should delete it. It does not have any broader context for an event, it just handles them one at a time.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.