A working example of how to import a Wikipedia dump with Logstash using es_bulk

I'm writing a logstash configuration file for importing a Wikipedia dump, found on https://dumps.wikimedia.org/other/cirrussearch/current/

The dumps are in the es_bulk format, ie one line for the action and id of the document and then a line containing the actual JSON data.

I'm changing codecs to make this work and the JSON codec inputs each line as a document. The es_bulk codec causes a crash and I can't for the life of me understand how a multiline statement would look to make it work.

This is my conf right now:

input {
    file {
        path => "/home/projects/wiki-load/swwikibooks-20201019-cirrussearch-general.json.gz"
        mode => "read"
        codec => "json"
        start_position => "beginning"
        file_completed_action => "log"
        file_completed_log_path => "/home/projects/wiki-load/log.txt"
    }
}
filter {
    json {
        source => "message"
    }
}
output {
    elasticsearch {
        hosts => ["localhost:9200"]
        index => "svwiki-20201012"
        document_type => "page"
    }
    stdout {
		codec => rubydebug { metadata => false }
	}
}

I need to capture the header line and the document line together, keeping the original structure and unique id of the document. Any suggestions here?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.