Indexing attachments (11 000), Parsers keep crashing elasticsearch

Hi,

Here is the full log (today) : Log
https://gist.github.com/anonymous/ef0cbf956714cf9b138f

this log contains other kind of error i made like typo on curl.. not
revelant for the indexing problem.

Most files are less than 2Mo. I had a problem with a 80Mo .rtf file but the
file was corrupted.

I'm not able to attach documents :

  • high confidentiality
  • Elasticsearch (parsers) does not produce interesting log, no filename,
    document reference or any usefull infos, i c'ant find wich file made crash
    the server. And the process does not run on a local server.
  • I did not handle errors correctly and now i can't determine it. And i
    can't re index all file now :frowning:

If i can find one, i'll post it.

But, like i said, i understand that a parser can crash because the file is
too big, corrupted... but Elasticsearch should not crash too ?

Thank you.

Le mercredi 9 juillet 2014 15:45:04 UTC+11, David Pilato a écrit :

Could you gist the full logs?
Do you have some "big" attachments?
Could you copy some failing attachments to bintray or any other service
and paste the link here?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 9 juil. 2014 à 05:42, aurelien bax <pico...@gmail.com <javascript:>> a
écrit :

Hi,

i'm trying to index 11 000 documents (pdf, word...).

My conf :

elasticSearch 1.2.1 , elasticsearch-river-jdbc-1.2.1.1-plugin.zip,
elasticsearch-mapper-attachments/2.0.0 on a Debian server.

I'm using elasticSearch-php. I don't think that posting my code is usefull.

I'm obliged to make small batches (from 50 to 200) because the parsers
raises exception and Elasticsearch is stopped...

I need to restart the server, re run the previous batch..

I m not reindexing all the docs, before starting the batch, the script
tries to check if a doc is already indexed and then skip it.

When i run previous batch it nearly always work without crashing
Elasticsearch.

So, i understand that the parsers can not handle every files (for many
reasons) but, why does it crash Elasticsearch ?

Why the execptions are not handled instead of crashing everything ?

Is there a way to handle exceptions before Elasticsearch chrash ?

Sample log errors :

[WARN ][org.apache.tika.parser.microsoft.AbstractPOIFSExtractor] Ignoring
unexpected exception while parsing summary entry ^ESummaryInformation
java.io.UnsupportedEncodingException: Codepage number may not be 0

[WARN ][org.apache.pdfbox.pdfparser.XrefTrailerResolver] Did not found
XRef object at specified startxref position 730864
[2014-07-09 10:51:40,078][WARN ][org.apache.pdfbox.pdfparser.BaseParser]
Specified stream length 587 is wrong. Fall back to reading stream until
'endstream'.

[org.apache.pdfbox.pdfparser.BaseParser] Specified stream length 952 is
wrong. Fall back to reading stream until 'endstream'.
[2014-07-09
11:52:08,044][ERROR][org.apache.pdfbox.pdmodel.font.PDSimpleFont] Can't
determine the width of the space character using 250 as default

Thank you

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5e682100-4bce-48c7-b43d-78ccd85b5750%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5e682100-4bce-48c7-b43d-78ccd85b5750%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b97c3ba8-fa27-40ac-a3c6-aa820bc6408a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.