How To: Back-fill Elasticsearch without losing data


(Charles Patton) #1

I want to import from my existing log file as a new file input like so to back-fill my old data:

input {
  path => [ '/path_local_to_es/file']
  start_position => 'beginning'
}

AND

Continue to capture the tail of the very same file as it's actively logging through a redis messaging queue...

On Data Source:

input {
    path => [ '/path_local_to_source/file' ]
    start_position => 'end'
}
filter {
    ...
}
output {
    redis {
         host => 'queue-to-elasticsearch'
         data_type => 'list'
         key => 'logstash'
    }
}

On Elasticsearch

input {
    redis {
        host => '127.0.0.1'
        data_type => 'list'
        key => 'logstash'
    }
}
output {
    elasticsearch {
        host => '127.0.0.1'
    }
}

How do I combine the two in a way that lets me back-fill all of the data I want from the file but not duplicate data that's already being ingested by the redis queue? I'm ok with stopping the queue for a moment but even then, how would I prevent missing data between when my local copy of the log file reaches EOF and the redis queue is turned back on?

Thanks.


(João Duarte) #2

The easiest way to support frequent backfilling is by controlling the document ids so that each document as a unique id before going to elasticsearch, either extracted from the source or computed using the document data itself (if it is unique enough).

If each document has a unique document id, it can be passed into the elasticsearch output like this: document_id => "%{[unique_id_field]}"

By doing this a single document can be index more than once to elasticsearch because it will be overwritten.


(system) #3