Indexing attachments (11 000), Parsers keep crashing elasticsearch


(aurelien bax) #1

Hi,

i'm trying to index 11 000 documents (pdf, word...).

My conf :

elasticSearch 1.2.1 , elasticsearch-river-jdbc-1.2.1.1-plugin.zip,
elasticsearch-mapper-attachments/2.0.0 on a Debian server.

I'm using elasticSearch-php. I don't think that posting my code is usefull.

I'm obliged to make small batches (from 50 to 200) because the parsers
raises exception and ElasticSearch is stopped...

I need to restart the server, re run the previous batch..

I m not reindexing all the docs, before starting the batch, the script
tries to check if a doc is already indexed and then skip it.

When i run previous batch it nearly always work without crashing
ElasticSearch.

So, i understand that the parsers can not handle every files (for many
reasons) but, why does it crash Elasticsearch ?

Why the execptions are not handled instead of crashing everything ?

Is there a way to handle exceptions before ElasticSearch chrash ?

Sample log errors :

[WARN ][org.apache.tika.parser.microsoft.AbstractPOIFSExtractor] Ignoring
unexpected exception while parsing summary entry ^ESummaryInformation
java.io.UnsupportedEncodingException: Codepage number may not be 0

[WARN ][org.apache.pdfbox.pdfparser.XrefTrailerResolver] Did not found XRef
object at specified startxref position 730864
[2014-07-09 10:51:40,078][WARN ][org.apache.pdfbox.pdfparser.BaseParser]
Specified stream length 587 is wrong. Fall back to reading stream until
'endstream'.

[org.apache.pdfbox.pdfparser.BaseParser] Specified stream length 952 is
wrong. Fall back to reading stream until 'endstream'.
[2014-07-09
11:52:08,044][ERROR][org.apache.pdfbox.pdmodel.font.PDSimpleFont] Can't
determine the width of the space character using 250 as default

Thank you

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5e682100-4bce-48c7-b43d-78ccd85b5750%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(David Pilato) #2

Could you gist the full logs?
Do you have some "big" attachments?
Could you copy some failing attachments to bintray or any other service and paste the link here?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 9 juil. 2014 à 05:42, aurelien bax picolo74@gmail.com a écrit :

Hi,

i'm trying to index 11 000 documents (pdf, word...).

My conf :

elasticSearch 1.2.1 , elasticsearch-river-jdbc-1.2.1.1-plugin.zip, elasticsearch-mapper-attachments/2.0.0 on a Debian server.

I'm using elasticSearch-php. I don't think that posting my code is usefull.

I'm obliged to make small batches (from 50 to 200) because the parsers raises exception and ElasticSearch is stopped...

I need to restart the server, re run the previous batch..

I m not reindexing all the docs, before starting the batch, the script tries to check if a doc is already indexed and then skip it.

When i run previous batch it nearly always work without crashing ElasticSearch.

So, i understand that the parsers can not handle every files (for many reasons) but, why does it crash Elasticsearch ?

Why the execptions are not handled instead of crashing everything ?

Is there a way to handle exceptions before ElasticSearch chrash ?

Sample log errors :

[WARN ][org.apache.tika.parser.microsoft.AbstractPOIFSExtractor] Ignoring unexpected exception while parsing summary entry ^ESummaryInformation
java.io.UnsupportedEncodingException: Codepage number may not be 0

[WARN ][org.apache.pdfbox.pdfparser.XrefTrailerResolver] Did not found XRef object at specified startxref position 730864
[2014-07-09 10:51:40,078][WARN ][org.apache.pdfbox.pdfparser.BaseParser] Specified stream length 587 is wrong. Fall back to reading stream until 'endstream'.

[org.apache.pdfbox.pdfparser.BaseParser] Specified stream length 952 is wrong. Fall back to reading stream until 'endstream'.
[2014-07-09 11:52:08,044][ERROR][org.apache.pdfbox.pdmodel.font.PDSimpleFont] Can't determine the width of the space character using 250 as default

Thank you

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5e682100-4bce-48c7-b43d-78ccd85b5750%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2F955DDE-F741-48E6-8A21-AA35603A97C1%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.


(aurelien bax) #3

Hi,

Here is the full log (today) : Log
https://gist.github.com/anonymous/ef0cbf956714cf9b138f

this log contains other kind of error i made like typo on curl.. not
revelant for the indexing problem.

Most files are less than 2Mo. I had a problem with a 80Mo .rtf file but the
file was corrupted.

I'm not able to attach documents :

  • high confidentiality
  • ElasticSearch (parsers) does not produce interesting log, no filename,
    document reference or any usefull infos, i c'ant find wich file made crash
    the server. And the process does not run on a local server.
  • I did not handle errors correctly and now i can't determine it. And i
    can't re index all file now :frowning:

If i can find one, i'll post it.

But, like i said, i understand that a parser can crash because the file is
too big, corrupted... but ElasticSearch should not crash too ?

Thank you.

Le mercredi 9 juillet 2014 15:45:04 UTC+11, David Pilato a écrit :

Could you gist the full logs?
Do you have some "big" attachments?
Could you copy some failing attachments to bintray or any other service
and paste the link here?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 9 juil. 2014 à 05:42, aurelien bax <pico...@gmail.com <javascript:>> a
écrit :

Hi,

i'm trying to index 11 000 documents (pdf, word...).

My conf :

elasticSearch 1.2.1 , elasticsearch-river-jdbc-1.2.1.1-plugin.zip,
elasticsearch-mapper-attachments/2.0.0 on a Debian server.

I'm using elasticSearch-php. I don't think that posting my code is usefull.

I'm obliged to make small batches (from 50 to 200) because the parsers
raises exception and ElasticSearch is stopped...

I need to restart the server, re run the previous batch..

I m not reindexing all the docs, before starting the batch, the script
tries to check if a doc is already indexed and then skip it.

When i run previous batch it nearly always work without crashing
ElasticSearch.

So, i understand that the parsers can not handle every files (for many
reasons) but, why does it crash Elasticsearch ?

Why the execptions are not handled instead of crashing everything ?

Is there a way to handle exceptions before ElasticSearch chrash ?

Sample log errors :

[WARN ][org.apache.tika.parser.microsoft.AbstractPOIFSExtractor] Ignoring
unexpected exception while parsing summary entry ^ESummaryInformation
java.io.UnsupportedEncodingException: Codepage number may not be 0

[WARN ][org.apache.pdfbox.pdfparser.XrefTrailerResolver] Did not found
XRef object at specified startxref position 730864
[2014-07-09 10:51:40,078][WARN ][org.apache.pdfbox.pdfparser.BaseParser]
Specified stream length 587 is wrong. Fall back to reading stream until
'endstream'.

[org.apache.pdfbox.pdfparser.BaseParser] Specified stream length 952 is
wrong. Fall back to reading stream until 'endstream'.
[2014-07-09
11:52:08,044][ERROR][org.apache.pdfbox.pdmodel.font.PDSimpleFont] Can't
determine the width of the space character using 250 as default

Thank you

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5e682100-4bce-48c7-b43d-78ccd85b5750%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5e682100-4bce-48c7-b43d-78ccd85b5750%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b97c3ba8-fa27-40ac-a3c6-aa820bc6408a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(David Pilato) #4

Could you open an issue in mapper attachment project and add all details?

Can you see any dump file in elasticsearch dir?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 9 juil. 2014 à 07:36, aurelien bax picolo74@gmail.com a écrit :

Hi,

Here is the full log (today) : Log

this log contains other kind of error i made like typo on curl.. not revelant for the indexing problem.

Most files are less than 2Mo. I had a problem with a 80Mo .rtf file but the file was corrupted.

I'm not able to attach documents :

  • high confidentiality
  • ElasticSearch (parsers) does not produce interesting log, no filename, document reference or any usefull infos, i c'ant find wich file made crash the server. And the process does not run on a local server.
  • I did not handle errors correctly and now i can't determine it. And i can't re index all file now :frowning:

If i can find one, i'll post it.

But, like i said, i understand that a parser can crash because the file is too big, corrupted... but ElasticSearch should not crash too ?

Thank you.

Le mercredi 9 juillet 2014 15:45:04 UTC+11, David Pilato a écrit :

Could you gist the full logs?
Do you have some "big" attachments?
Could you copy some failing attachments to bintray or any other service and paste the link here?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 9 juil. 2014 à 05:42, aurelien bax pico...@gmail.com a écrit :

Hi,

i'm trying to index 11 000 documents (pdf, word...).

My conf :

elasticSearch 1.2.1 , elasticsearch-river-jdbc-1.2.1.1-plugin.zip, elasticsearch-mapper-attachments/2.0.0 on a Debian server.

I'm using elasticSearch-php. I don't think that posting my code is usefull.

I'm obliged to make small batches (from 50 to 200) because the parsers raises exception and ElasticSearch is stopped...

I need to restart the server, re run the previous batch..

I m not reindexing all the docs, before starting the batch, the script tries to check if a doc is already indexed and then skip it.

When i run previous batch it nearly always work without crashing ElasticSearch.

So, i understand that the parsers can not handle every files (for many reasons) but, why does it crash Elasticsearch ?

Why the execptions are not handled instead of crashing everything ?

Is there a way to handle exceptions before ElasticSearch chrash ?

Sample log errors :

[WARN ][org.apache.tika.parser.microsoft.AbstractPOIFSExtractor] Ignoring unexpected exception while parsing summary entry ^ESummaryInformation
java.io.UnsupportedEncodingException: Codepage number may not be 0

[WARN ][org.apache.pdfbox.pdfparser.XrefTrailerResolver] Did not found XRef object at specified startxref position 730864
[2014-07-09 10:51:40,078][WARN ][org.apache.pdfbox.pdfparser.BaseParser] Specified stream length 587 is wrong. Fall back to reading stream until 'endstream'.

[org.apache.pdfbox.pdfparser.BaseParser] Specified stream length 952 is wrong. Fall back to reading stream until 'endstream'.
[2014-07-09 11:52:08,044][ERROR][org.apache.pdfbox.pdmodel.font.PDSimpleFont] Can't determine the width of the space character using 250 as default

Thank you

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5e682100-4bce-48c7-b43d-78ccd85b5750%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b97c3ba8-fa27-40ac-a3c6-aa820bc6408a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/74A79AA8-B221-472D-8760-33385D279F06%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.


(system) #5