Indexing office/PDF documents from directory

vinit2710 · July 19, 2017, 2:08pm

I am new to elastic and looking for connector to ingest office/pdf documents from file system.
With old versions there used to be river connector which is no longer supported (i assume)
I tried using logstash for the purpose but somehow its not working
Any suggestions?
following is logstash configuration

input {
file {
path => "D:/elastic/data/resumes/."
codec => plain { charset => "ISO-8859-1" }
start_position => "beginning"
}
}
filter
{
ruby {
init => "require 'base64'"
code => "event.set('data', Base64.encode64(event.get('message')))"
}
mutate {
remove_field => ["message"]
}
}
output {
elasticsearch {
action => "index"
hosts => ["192.168.1.2:9200"]
index => "resumedata"
document_type => "resumes"
pipeline => "attachment"
}
stdout { codec => rubydebug }
}

dadoonet · July 19, 2017, 3:38pm

You might want to give a try to FSCrawler project.

system · August 16, 2017, 3:38pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Indexing word, pdf documents? Elasticsearch	12	6120	July 7, 2020
Ingesting documents (pdf, word, .txt) to elasticsearch Elasticsearch	31	38664	March 21, 2017
Index PDF in ES Elasticsearch	14	9109	April 24, 2017
[ANN] Filesystem River for Elasticsearch 0.0.1 Elasticsearch	5	386	July 6, 2017
How to Index a resume in elasticsearch and write a search query to find particular word Elasticsearch	7	2915	March 15, 2017

Indexing office/PDF documents from directory

Related topics