Logstash is not showing base64 encoded data for pdf's extracted from urls

Disha_Bodade · April 20, 2023, 10:42am

Hi Team,
I am using logstash http filter to get pdf from url and extract it.
http filter has downloaded pdf and extracted its content on target_field. But the contents are not proper and also its not base64 encoded. How I can handle this encoding at logstash, so that I can use ingest pipeline at elasticsearch to index PDF's

Below is the example filter

        http {
                url => "%{uri}"
                cacert => "/etc/logstash/conf.d/read-pdfs/ca.pem"
                target_body => "content_body"
       }

the content body has data in below format in addition with english text in it.

��������뿟˃O��t��̦\u0007W��\u0017x������\u001f\u001eFǧ'�\u001f�{I��\u0005�Q\u0012��\u0015<���{��.���\u001d_��\u001d��\"�q}��ǀ(�XT�c\\F��q.��/@��U\u0016=�@��\u0003�ʫ_?���:���G�?��As���[Ϸ\"�4Y�\u0012�\u001cTLp\u0012Hd1�/�(cEt�Ҝ\u0019?�Et:������I\u0014\u001d\\=�����'�Qrp1�>D�r:��x��>�G\u001fi��$%}�\u001aܽw��v\u0013�gq�:%R\"�DgӻmG�ϊI\u001e���j��`��8�0�FЬ\u0014p�����> �,\u0003M~���C�W��H\f��p�8\f�ʾA�8\f��2�0

How can I handle this encoding and decoding either at logstash or elasticsearch.

Thanks,
Disha

Rios · April 20, 2023, 11:18am

Default encoding is UTF-8. You can use the codec with charset:

        http {
                url => "%{uri}"
                cacert => "/etc/logstash/conf.d/read-pdfs/ca.pem"
                target_body => "content_body"
                 codec => plain { charset=>"ISO8859-1" }
       }

The list supported character-set is here

Disha_Bodade · April 20, 2023, 12:35pm

@Rios ,
http filter does not has codec setting, I tried to add it to logstash output plugin but still same result.

output {
 elasticsearch {
                hosts => ["https://esost1:9200"]
                index => "my_index"
                codec => plain { charset=>"ISO8859-1" }
                ssl=> true
                ssl_certificate_verification => false
  #              cacert => "/etc/logstash/certs/ca.pem"
        }
}

Thanks,
Disha

Disha_Bodade · April 28, 2023, 4:58am

Is there any way, I can create base64 formatted string in logstash? That way I can convert extracted contents from UTF-8 to base64 and then used attachment preprocessor.

Badger · April 28, 2023, 5:05am

I would lean towards a ruby filter. This is about decoding base64, but should give some clues about encoding.

Disha_Bodade · April 28, 2023, 6:58am

@Badger,
The input source is database, has one field as url, I am using logstash http filter to get data from those urls, urls has pdf, word, text files.

files size is big as upto 6MB.
Is it possible to encode whole content extracted by http filter into base64 format and then use attachment processor on that encoded data?

Or let me know any other available option.

system · May 26, 2023, 6:58am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Base64 decode Logstash	4	7505	July 18, 2019
Output: how to save a field to a pdf file? Logstash	4	866	February 17, 2020
How to encode file with base 64 Logstash	1	317	May 1, 2018
Howto Decode64 attachment Elasticsearch	4	1169	July 6, 2017
Advantages of base64 encoded content in ingest attachment plugin Elasticsearch	3	1615	May 1, 2018

Logstash is not showing base64 encoded data for pdf's extracted from urls

Related topics