Elasticsearch-mapper-attachments errors logs


#1

Hi,

I am testing attachment with elasticsearch-mapper-attachments, a simple job to import document with curl :

i encountered some errors and warn :

[2015-09-29 23:49:53,013][ERROR][org.apache.pdfbox.filter.FlateFilter] FlateFilter: stop reading corrupt stream due to a DataFormatException
[2015-09-29 23:50:15,416][WARN ][org.apache.fontbox.util.FontManager] Font not found: Tahoma
[2015-09-29 23:53:47,390][WARN ][org.apache.pdfbox.pdfparser.BaseParser] Specified stream length 127 is wrong. Fall back to reading stream until 'endstream'.

How can i add in elasticsearch.log, the filename regarding the error ?

Thanks


(David Pilato) #2

Elasticsearch mapper attachments is only getting here a BASE64 binary content. It does not know at all that it comes from a file (so a filename) or from a blob within your database or from a URL... Whatever...

So I'm afraid there is no way for doing that.

You should consider doing that on the client. So when you send a file, you know its filename and you can probably read the response from elasticsearch and knows that something goes wrong with file X.


#3

Hi Dadoonet,

Thanks for your reply, of course i used base64 encoded with an ETL,
88 262 (44go) documents successfully parsed,indexed on 88 933 files.

When i am checking my client logs the response of curl still : "created":true
{"_index":"repo","_type":"attachment","_id":"AVAZBfl3Iw8zpJxWbnIQ","_version":1,"created":true}

Thanks.


(David Pilato) #4

Indeed. It's because we ignore errors by default.
You can change that with this.


#5

My bad ! Missing this part ! Thanks. Reimporting docs...


#6

Hi,

I created index with the index.mapping.attachment.ignore_errors set to false.

{
"test2" : {
"settings" : {
"index" : {
"index" : {
"mapping" : {
"attachment" : {
"ignore_errors" : "false",
"indexed_chars" : "-1"
}
}
},
"creation_date" : "1443609269356",
"number_of_shards" : "1",
"number_of_replicas" : "0",
"version" : {
"created" : "1070199"
},
"uuid" : "8GM9Ud9yQg6MJJbdSZuTAQ"
}
}
}
}
I encountered some errors but all response from elasticsearch are "created":true

[2015-09-30 23:31:00,763][ERROR][org.apache.pdfbox.filter.FlateFilter] FlateFilter: stop reading corrupt stream due to a DataFormatException
[2015-09-30 23:45:34,565][ERROR][org.apache.pdfbox.filter.FlateFilter] FlateFilter: stop reading corrupt stream due to a DataFormatException


(David Pilato) #7

Can you open an issue, ideally with a link to a file we could reuse within a test ?


(system) #8