Logstash read pdf by ruby-Filter

I want use logstash to read the pdf、docx ....
I think the ruby-Filter can solve the problem
so I do some try:
1)create a function pdf_to_text(pdf_filename)
require 'docsplit'
def pdf_to_text(pdf_filename)
Docsplit.extract_text([pdf_filename], ocr: false, output: Dir.tmpdir)
txt_file = File.basename(pdf_filename, File.extname(pdf_filename)) + '.txt'
txt_filename = Dir.tmpdir + '/' + txt_file
extracted_text = File.read(txt_filename)
File.delete(txt_filename)
extracted_text
end
2)use the function
pdf_to_text('C:/Ruby27-x64/sample2.pdf')
but appear the below error
Traceback (most recent call last):
12: from C:/Ruby27-x64/bin/irb.cmd:31:in <main>' 11: from C:/Ruby27-x64/bin/irb.cmd:31:in load'
10: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/irb-1.2.6/exe/irb:11:in <top (required)>' 9: from (irb):10 8: from (irb):3:in pdf_to_text'
7: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit.rb:52:in extract_text' 6: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit/text_extractor.rb:32:in extract'
5: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit/text_extractor.rb:32:in each' 4: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit/text_extractor.rb:38:in block in extract'
3: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit/text_extractor.rb:54:in extract_from_pdf' 2: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit/text_extractor.rb:108:in extract_full'
1: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit/text_extractor.rb:101:in run' Traceback (most recent call last): 17: from C:/Ruby27-x64/bin/irb.cmd:31:in '
16: from C:/Ruby27-x64/bin/irb.cmd:31:in load' 15: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/irb-1.2.6/exe/irb:11:in <top (required)>'
14: from C:/Ruby27-x64/lib/ruby/2.7.0/irb.rb:400:in start' 13: from C:/Ruby27-x64/lib/ruby/2.7.0/irb.rb:471:in run'
12: from C:/Ruby27-x64/lib/ruby/2.7.0/irb.rb:471:in catch' 11: from C:/Ruby27-x64/lib/ruby/2.7.0/irb.rb:472:in block in run'
10: from C:/Ruby27-x64/lib/ruby/2.7.0/irb.rb:537:in eval_input' 9: from C:/Ruby27-x64/lib/ruby/2.7.0/irb/ruby-lex.rb:150:in each_top_level_statement'
8: from C:/Ruby27-x64/lib/ruby/2.7.0/irb/ruby-lex.rb:150:in catch' 7: from C:/Ruby27-x64/lib/ruby/2.7.0/irb/ruby-lex.rb:151:in block in each_top_level_statement'
6: from C:/Ruby27-x64/lib/ruby/2.7.0/irb/ruby-lex.rb:151:in loop' 5: from C:/Ruby27-x64/lib/ruby/2.7.0/irb/ruby-lex.rb:166:in block (2 levels) in each_top_level_statement'
4: from C:/Ruby27-x64/lib/ruby/2.7.0/irb.rb:538:in block in eval_input' 3: from C:/Ruby27-x64/lib/ruby/2.7.0/irb.rb:704:in signal_status'
2: from C:/Ruby27-x64/lib/ruby/2.7.0/irb.rb:559:in block (2 levels) in eval_input' 1: from C:/Ruby27-x64/lib/ruby/2.7.0/irb.rb:607:in handle_exception'
C:/Ruby27-x64/lib/ruby/2.7.0/irb.rb:607:in `split': invalid byte sequence in UTF-8 (ArgumentError)

by a way:my system is window10 and the ruby version is 2.7.2p137

Hi,

Welcome to this forum!

According to your trace this error is not from LogStash but from your local Ruby installation and the interactive shell, right? Did you install docsplit using gem install docsplit?

Also, please be aware that LogStash brings its own Ruby version which might be a different version and has its own gem store.

Best regards
Wolfram

yes,I try it on my local Ruby installation and the interactive shell

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.