Logstash is not showing base64 encoded data for pdf's extracted from urls

Hi Team,
I am using logstash http filter to get pdf from url and extract it.
http filter has downloaded pdf and extracted its content on target_field. But the contents are not proper and also its not base64 encoded. How I can handle this encoding at logstash, so that I can use ingest pipeline at elasticsearch to index PDF's

Below is the example filter

        http {
                url => "%{uri}"
                cacert => "/etc/logstash/conf.d/read-pdfs/ca.pem"
                target_body => "content_body"
       }

the content body has data in below format in addition with english text in it.

��������뿟˃O��t��̦\u0007W��\u0017x������\u001f\u001eFǧ'�\u001f�{I��\u0005�Q\u0012��\u0015<���{��.���\u001d_��\u001d��\"�q}��ǀ(�XT�c\\F��q.��/@��U\u0016=�@��\u0003�ʫ_?���:���G�?��As���[Ϸ\"�4Y�\u0012�\u001cTLp\u0012Hd1�/�(cEt�Ҝ\u0019?�Et:������I\u0014\u001d\\=�����'�Qrp1�>D�r:��x��>�G\u001fi��$%}�\u001aܽw��v\u0013�gq�:%R\"�DgӻmG�ϊI\u001e���j��`��8�0�FЬ\u0014p�����> �,\u0003M~���C�W��H\f��p�8\f�ʾA�8\f��2�0

How can I handle this encoding and decoding either at logstash or elasticsearch.

Thanks,
Disha

Default encoding is UTF-8. You can use the codec with charset:

        http {
                url => "%{uri}"
                cacert => "/etc/logstash/conf.d/read-pdfs/ca.pem"
                target_body => "content_body"
                 codec => plain { charset=>"ISO8859-1" }
       }

The list supported character-set is here

@Rios ,
http filter does not has codec setting, I tried to add it to logstash output plugin but still same result.

output {
 elasticsearch {
                hosts => ["https://esost1:9200"]
                index => "my_index"
                codec => plain { charset=>"ISO8859-1" }
                ssl=> true
                ssl_certificate_verification => false
  #              cacert => "/etc/logstash/certs/ca.pem"
        }
}

Thanks,
Disha

Is there any way, I can create base64 formatted string in logstash? That way I can convert extracted contents from UTF-8 to base64 and then used attachment preprocessor.

I would lean towards a ruby filter. This is about decoding base64, but should give some clues about encoding.

@Badger,
The input source is database, has one field as url, I am using logstash http filter to get data from those urls, urls has pdf, word, text files.

files size is big as upto 6MB.
Is it possible to encode whole content extracted by http filter into base64 format and then use attachment processor on that encoded data?

Or let me know any other available option.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.