I'm writing a logstash configuration file for importing a Wikipedia dump, found on https://dumps.wikimedia.org/other/cirrussearch/current/
The dumps are in the es_bulk format, ie one line for the action and id of the document and then a line containing the actual JSON data.
I'm changing codecs to make this work and the JSON codec inputs each line as a document. The es_bulk codec causes a crash and I can't for the life of me understand how a multiline statement would look to make it work.
This is my conf right now:
input {
file {
path => "/home/projects/wiki-load/swwikibooks-20201019-cirrussearch-general.json.gz"
mode => "read"
codec => "json"
start_position => "beginning"
file_completed_action => "log"
file_completed_log_path => "/home/projects/wiki-load/log.txt"
}
}
filter {
json {
source => "message"
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "svwiki-20201012"
document_type => "page"
}
stdout {
codec => rubydebug { metadata => false }
}
}
I need to capture the header line and the document line together, keeping the original structure and unique id of the document. Any suggestions here?