I am trying to read a BLOB data stored in a Mysql table using JDBC connector in Logstash. I intend to index the data in Elasticsearch. I have been able to read a pdf file however when i read a Docx file, i get an error. I am using the following Ruby Code in the FILTER of Logstash:
file_to_read= StringIO.new(event.get('resume'))
# THIS STATEMENT GIVE AN ERROR STRING CONTAINS A NULL BYTE
doc= Docx::Document.open (file_to_read)
data=[]
# Retrieve and display paragraphs
doc.paragraphs.each do |p|
# SINCE THE PARAGRAPHS ARE IN AN ARRAY, THE OUTPUT IS SENT TO AN ARRAY
data<<p.text
# event.set('doc_content', data.to_s)
end
If i write the binary string output from (event.get('resume')) to a file and then read it using doc= Docx::Document.open('FILE_TO_READ') then there is no error.
MY QUESTION 1: Is it possible to directly read the binary string into Docx GEM as the documentation says it can read a Buffer also?
QUESTION 2: If i write the files stored as BLOB to disk and then start reading them, i am not able to read all the files using this code and only the first Docx file stored in id=1 is picked and duplicated in id=2.
jdbc {
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://localhost:3306/mysqljdbc"
jdbc_user => "root"
jdbc_password => "root"
schedule => "*/1 * * * *"
statement => "SELECT * FROM mysqljdbc.candidates WHERE id IN(1, 2, )"
}
filter{
ruby{
code => "
require 'pdf-reader'
require 'json'
# added to use io object
require 'stringio'
require 'docx'
format=event.get('type')
# Convert BINARY STRINGS to IO object
# StringIO allows strings to behave like IOs. This is useful when we want to pass strings into systems that consume streams. This is common in tests where we might inject a StringIO instead of reading an actual file from disk.
if format!= nil
file_to_read= StringIO.new(event.get('FIELD'))
end
File.binwrite('new.docx',event.get('FIELD'))
doc= Docx::Document.open('PATH/new.docx')
data=[]
# Retrieve and display paragraphs
doc.paragraphs.each do |p|
# SINCE THE PARAGRAPHS ARE IN AN ARRAY, THE OUTPUT IS SENT TO AN ARRAY
data<<p.text
# event.set('doc_content', data.to_s)
end
#-----------FOR OPENING TABLE CONTENT
if(doc.tables[0]!=nil) then
# Iterate through tables
doc.tables.each do |table|
table.rows.each do |row| # Row-based iteration
row.cells.each do |cell|
data<< cell.text
end
end
end
else
end
event.set('doc_content', data.to_s)
"
}
}
Grateful for some pointers to the right way. Thanks in advance