A working example of how to import a Wikipedia dump with Logstash using es_bulk

Ola_Gustafsson1 · October 25, 2020, 9:46am

I'm writing a logstash configuration file for importing a Wikipedia dump, found on https://dumps.wikimedia.org/other/cirrussearch/current/

The dumps are in the es_bulk format, ie one line for the action and id of the document and then a line containing the actual JSON data.

I'm changing codecs to make this work and the JSON codec inputs each line as a document. The es_bulk codec causes a crash and I can't for the life of me understand how a multiline statement would look to make it work.

This is my conf right now:

input {
    file {
        path => "/home/projects/wiki-load/swwikibooks-20201019-cirrussearch-general.json.gz"
        mode => "read"
        codec => "json"
        start_position => "beginning"
        file_completed_action => "log"
        file_completed_log_path => "/home/projects/wiki-load/log.txt"
    }
}
filter {
    json {
        source => "message"
    }
}
output {
    elasticsearch {
        hosts => ["localhost:9200"]
        index => "svwiki-20201012"
        document_type => "page"
    }
    stdout {
		codec => rubydebug { metadata => false }
	}
}

I need to capture the header line and the document line together, keeping the original structure and unique id of the document. Any suggestions here?

system · November 22, 2020, 9:46am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Import .log file in elasticsearch and Kibana Elasticsearch	4	6311	December 9, 2018
Logstash es_bulk codec only processing last event Logstash	3	1093	July 6, 2017
Logstash Plugin es_bulk - Need a working exampleof bulk indexing Logstash	8	728	February 6, 2019
Receiving elasticsearch bulks in Logstash input Logstash	2	56	April 8, 2025
Issue with modelling data while bulk import Logstash	4	328	May 8, 2018

A working example of how to import a Wikipedia dump with Logstash using es_bulk

Related topics