I want use logstash to read the pdf、docx ....
I think the ruby-Filter can solve the problem
so I do some try:
1)create a function pdf_to_text(pdf_filename)
require 'docsplit'
def pdf_to_text(pdf_filename)
Docsplit.extract_text([pdf_filename], ocr: false, output: Dir.tmpdir)
txt_file = File.basename(pdf_filename, File.extname(pdf_filename)) + '.txt'
txt_filename = Dir.tmpdir + '/' + txt_file
extracted_text = File.read(txt_filename)
File.delete(txt_filename)
extracted_text
end
2)use the function
pdf_to_text('C:/Ruby27-x64/sample2.pdf')
but appear the below error
Traceback (most recent call last):
12: from C:/Ruby27-x64/bin/irb.cmd:31:in <main>' 11: from C:/Ruby27-x64/bin/irb.cmd:31:in
load'
10: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/irb-1.2.6/exe/irb:11:in <top (required)>' 9: from (irb):10 8: from (irb):3:in
pdf_to_text'
7: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit.rb:52:in extract_text' 6: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit/text_extractor.rb:32:in
extract'
5: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit/text_extractor.rb:32:in each' 4: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit/text_extractor.rb:38:in
block in extract'
3: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit/text_extractor.rb:54:in extract_from_pdf' 2: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit/text_extractor.rb:108:in
extract_full'
1: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit/text_extractor.rb:101:in run' Traceback (most recent call last): 17: from C:/Ruby27-x64/bin/irb.cmd:31:in
'
16: from C:/Ruby27-x64/bin/irb.cmd:31:in load' 15: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/irb-1.2.6/exe/irb:11:in
<top (required)>'
14: from C:/Ruby27-x64/lib/ruby/2.7.0/irb.rb:400:in start' 13: from C:/Ruby27-x64/lib/ruby/2.7.0/irb.rb:471:in
run'
12: from C:/Ruby27-x64/lib/ruby/2.7.0/irb.rb:471:in catch' 11: from C:/Ruby27-x64/lib/ruby/2.7.0/irb.rb:472:in
block in run'
10: from C:/Ruby27-x64/lib/ruby/2.7.0/irb.rb:537:in eval_input' 9: from C:/Ruby27-x64/lib/ruby/2.7.0/irb/ruby-lex.rb:150:in
each_top_level_statement'
8: from C:/Ruby27-x64/lib/ruby/2.7.0/irb/ruby-lex.rb:150:in catch' 7: from C:/Ruby27-x64/lib/ruby/2.7.0/irb/ruby-lex.rb:151:in
block in each_top_level_statement'
6: from C:/Ruby27-x64/lib/ruby/2.7.0/irb/ruby-lex.rb:151:in loop' 5: from C:/Ruby27-x64/lib/ruby/2.7.0/irb/ruby-lex.rb:166:in
block (2 levels) in each_top_level_statement'
4: from C:/Ruby27-x64/lib/ruby/2.7.0/irb.rb:538:in block in eval_input' 3: from C:/Ruby27-x64/lib/ruby/2.7.0/irb.rb:704:in
signal_status'
2: from C:/Ruby27-x64/lib/ruby/2.7.0/irb.rb:559:in block (2 levels) in eval_input' 1: from C:/Ruby27-x64/lib/ruby/2.7.0/irb.rb:607:in
handle_exception'
C:/Ruby27-x64/lib/ruby/2.7.0/irb.rb:607:in `split': invalid byte sequence in UTF-8 (ArgumentError)
by a way:my system is window10 and the ruby version is 2.7.2p137