ES + Attachment --> indexed documents incomplete


(bdonnovan) #1

Hey everyone,

i am pretty much a brand new user to elasticsearch and i very much
like it so far. I am using ES in version 18,7 with the attachment-
mapper plugin. It is properly installed and i see it getting picked up
on engine startup. I am facing a weird problem trying to index text/
plain or pdf content. When i provide the base64 encoded content, and
start the indexing , everything ends fine with no errors thrown, but
when i check the index or try to query the index i see that only a
part of the document has been successfully indexed, after that certain
point everything else is just missing and cannot be queried. I tried
multiple files now pdf and text files (length of 2 to 3MB), yet i have
the same effect each and every time, It seems that there are no more
than 4000 terms that get indexed and i wonder why. I set logging to
debug, and i see no warnings or errors at all while indexing. I guess
i am doing something wrong and i hope someone out there has a clue
what it could be. It would be much appreciated.

My code basically looks like this:

And one of the failing test documents would be this pdf one:

http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&ved=0CDcQFjAC&url=http%3A%2F%2Fwww-ivs.cs.uni-magdeburg.de%2Fsw-eng%2Fagruppe%2Fforschung%2Fdiplomarbeiten%2Frene.pdf&ei=EMQ6T-z3K4_asgbAqLTlBg&usg=AFQjCNH948NSAdF_VaO_OUNQtXsCP1WRTQ

Any suggestions on how to get it right ?


(ppearcy) #2

Elasticsearch integrates tika for this:
http://tika.apache.org/

I'd recommend downloading tika seperately and running the "tika-app",
a nice little gui tool you can drag and drop files on to. I'm pretty
sure elasticsearch is on tika 1.0.

This will show you what it is doing under the hood and might shed some
light on things.

Best Regards,
Paul

On Feb 14, 1:32 pm, bdonnovan bdonno...@googlemail.com wrote:

Hey everyone,

i am pretty much a brand new user to elasticsearch and i very much
like it so far. I am using ES in version 18,7 with the attachment-
mapper plugin. It is properly installed and i see it getting picked up
on engine startup. I am facing a weird problem trying to index text/
plain or pdf content. When i provide the base64 encoded content, and
start the indexing , everything ends fine with no errors thrown, but
when i check the index or try to query the index i see that only a
part of the document has been successfully indexed, after that certain
point everything else is just missing and cannot be queried. I tried
multiple files now pdf and text files (length of 2 to 3MB), yet i have
the same effect each and every time, It seems that there are no more
than 4000 terms that get indexed and i wonder why. I set logging to
debug, and i see no warnings or errors at all while indexing. I guess
i am doing something wrong and i hope someone out there has a clue
what it could be. It would be much appreciated.

My code basically looks like this:https://gist.github.com/1830040

And one of the failing test documents would be this pdf one:

http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&ved=0C...

Any suggestions on how to get it right ?


(Shairon Toledo) #3

I suggest you to change the readContent method, to
use org.elasticsearch.common.Base64.encodeFromFile instead.

On Tue, Feb 14, 2012 at 6:51 PM, ppearcy ppearcy@gmail.com wrote:

Elasticsearch integrates tika for this:
http://tika.apache.org/

I'd recommend downloading tika seperately and running the "tika-app",
a nice little gui tool you can drag and drop files on to. I'm pretty
sure elasticsearch is on tika 1.0.

This will show you what it is doing under the hood and might shed some
light on things.

Best Regards,
Paul

On Feb 14, 1:32 pm, bdonnovan bdonno...@googlemail.com wrote:

Hey everyone,

i am pretty much a brand new user to elasticsearch and i very much
like it so far. I am using ES in version 18,7 with the attachment-
mapper plugin. It is properly installed and i see it getting picked up
on engine startup. I am facing a weird problem trying to index text/
plain or pdf content. When i provide the base64 encoded content, and
start the indexing , everything ends fine with no errors thrown, but
when i check the index or try to query the index i see that only a
part of the document has been successfully indexed, after that certain
point everything else is just missing and cannot be queried. I tried
multiple files now pdf and text files (length of 2 to 3MB), yet i have
the same effect each and every time, It seems that there are no more
than 4000 terms that get indexed and i wonder why. I set logging to
debug, and i see no warnings or errors at all while indexing. I guess
i am doing something wrong and i hope someone out there has a clue
what it could be. It would be much appreciated.

My code basically looks like this:https://gist.github.com/1830040

And one of the failing test documents would be this pdf one:

http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&ved=0C...

Any suggestions on how to get it right ?

--
[ ]'s
Shairon Toledo


(bdonnovan) #4

Thanks for the hint.

I just tried tika directly and all the tested documents can be parsed
fully and correctly. I see the whole text to the very end.

But with ES still no luck, it also doesn't change a thing when i take
the pdf extracted text and put it into a plain text file and try to
drop the index and feed it again with that, the result is the same.

Is there be any sort of configuration that could be messed up? I
didn't change anything though, so this is rather unlikely.

On 14 Feb., 21:51, ppearcy ppea...@gmail.com wrote:

Elasticsearch integrates tika for this:http://tika.apache.org/

I'd recommend downloading tika seperately and running the "tika-app",
a nice little gui tool you can drag and drop files on to. I'm pretty
sure elasticsearch is on tika 1.0.

This will show you what it is doing under the hood and might shed some
light on things.

Best Regards,
Paul

On Feb 14, 1:32 pm, bdonnovan bdonno...@googlemail.com wrote:

Hey everyone,

i am pretty much a brand new user to elasticsearch and i very much
like it so far. I am using ES in version 18,7 with the attachment-
mapper plugin. It is properly installed and i see it getting picked up
on engine startup. I am facing a weird problem trying to index text/
plain or pdf content. When i provide the base64 encoded content, and
start the indexing , everything ends fine with no errors thrown, but
when i check the index or try to query the index i see that only a
part of the document has been successfully indexed, after that certain
point everything else is just missing and cannot be queried. I tried
multiple files now pdf and text files (length of 2 to 3MB), yet i have
the same effect each and every time, It seems that there are no more
than 4000 terms that get indexed and i wonder why. I set logging to
debug, and i see no warnings or errors at all while indexing. I guess
i am doing something wrong and i hope someone out there has a clue
what it could be. It would be much appreciated.

My code basically looks like this:https://gist.github.com/1830040

And one of the failing test documents would be this pdf one:

http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&ved=0C...

Any suggestions on how to get it right ?


(bdonnovan) #5

Thanks for the quick replies !

I tried using the es-enclosed Base64.encodeFromFile() utility before
and once more just now, no luck unfortunately.

On 14 Feb., 22:35, bdonnovan bdonno...@googlemail.com wrote:

Thanks for the hint.

I just tried tika directly and all the tested documents can be parsed
fully and correctly. I see the whole text to the very end.

But with ES still no luck, it also doesn't change a thing when i take
the pdf extracted text and put it into a plain text file and try to
drop the index and feed it again with that, the result is the same.

Is there be any sort of configuration that could be messed up? I
didn't change anything though, so this is rather unlikely.

On 14 Feb., 21:51, ppearcy ppea...@gmail.com wrote:

Elasticsearch integrates tika for this:http://tika.apache.org/

I'd recommend downloading tika seperately and running the "tika-app",
a nice little gui tool you can drag and drop files on to. I'm pretty
sure elasticsearch is on tika 1.0.

This will show you what it is doing under the hood and might shed some
light on things.

Best Regards,
Paul

On Feb 14, 1:32 pm, bdonnovan bdonno...@googlemail.com wrote:

Hey everyone,

i am pretty much a brand new user to elasticsearch and i very much
like it so far. I am using ES in version 18,7 with the attachment-
mapper plugin. It is properly installed and i see it getting picked up
on engine startup. I am facing a weird problem trying to index text/
plain or pdf content. When i provide the base64 encoded content, and
start the indexing , everything ends fine with no errors thrown, but
when i check the index or try to query the index i see that only a
part of the document has been successfully indexed, after that certain
point everything else is just missing and cannot be queried. I tried
multiple files now pdf and text files (length of 2 to 3MB), yet i have
the same effect each and every time, It seems that there are no more
than 4000 terms that get indexed and i wonder why. I set logging to
debug, and i see no warnings or errors at all while indexing. I guess
i am doing something wrong and i hope someone out there has a clue
what it could be. It would be much appreciated.

My code basically looks like this:https://gist.github.com/1830040

And one of the failing test documents would be this pdf one:

http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&ved=0C...

Any suggestions on how to get it right ?


(Henac) #6

I am getting exactly the same behaviour in my app. I will debugging
tomorrow and will post if we find out what is causing this behaviour.

On Feb 15, 8:46 am, bdonnovan bdonno...@googlemail.com wrote:

Thanks for the quick replies !

I tried using the es-enclosed Base64.encodeFromFile() utility before
and once more just now, no luck unfortunately.

On 14 Feb., 22:35, bdonnovan bdonno...@googlemail.com wrote:

Thanks for the hint.

I just tried tika directly and all the tested documents can be parsed
fully and correctly. I see the whole text to the very end.

But with ES still no luck, it also doesn't change a thing when i take
the pdf extracted text and put it into a plain text file and try to
drop the index and feed it again with that, the result is the same.

Is there be any sort of configuration that could be messed up? I
didn't change anything though, so this is rather unlikely.

On 14 Feb., 21:51, ppearcy ppea...@gmail.com wrote:

Elasticsearch integrates tika for this:http://tika.apache.org/

I'd recommend downloading tika seperately and running the "tika-app",
a nice little gui tool you can drag and drop files on to. I'm pretty
sure elasticsearch is on tika 1.0.

This will show you what it is doing under the hood and might shed some
light on things.

Best Regards,
Paul

On Feb 14, 1:32 pm, bdonnovan bdonno...@googlemail.com wrote:

Hey everyone,

i am pretty much a brand new user to elasticsearch and i very much
like it so far. I am using ES in version 18,7 with theattachment-
mapper plugin. It is properly installed and i see it getting picked up
on engine startup. I am facing a weird problem trying to index text/
plain or pdf content. When i provide the base64 encoded content, and
start the indexing , everything ends fine with no errors thrown, but
when i check the index or try to query the index i see that only a
part of the document has been successfully indexed, after that certain
point everything else is just missing and cannot be queried. I tried
multiple files now pdf and text files (length of 2 to 3MB), yet i have
the same effect each and every time, It seems that there are no more
than 4000 terms that get indexed and i wonder why. I set logging to
debug, and i see no warnings or errors at all while indexing. I guess
i am doing something wrong and i hope someone out there has a clue
what it could be. It would be much appreciated.

My code basically looks like this:https://gist.github.com/1830040

And one of the failing test documents would be this pdf one:

http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&ved=0C...

Any suggestions on how to get it right ?


(Henac) #7

I think I have found what the issue is. The problem is in the org.
elasticsearch.index.mapper.attachment.AttachmentMapper Class, where tika is
being called to extract the text from the document.

parsedContent = tika().parseToString(new FastByteArrayInputStream(content),
metadata);

The javadoc for parseToString() states: "To avoid unpredictable excess
memory use, the returned string contains only up to getMaxStringLength()http://tika.apache.org/0.10/api/org/apache/tika/Tika.html#getMaxStringLength()
first characters extracted from the input document. Use the
setMaxStringLength(int)http://tika.apache.org/0.10/api/org/apache/tika/Tika.html#setMaxStringLength(int)
method to adjust this limitation." and the default value for
maxStringLength = 100,000.

I am going to submit a patch to have setMaxStringLength(inthttp://tika.apache.org/0.10/api/org/apache/tika/Tika.html#setMaxStringLength(int)
) called prior to the parseToString() with a value from the elastic search
configuration file.

Cheers

On Thursday, 1 March 2012 20:18:04 UTC+11, Henac wrote:

I am getting exactly the same behaviour in my app. I will debugging
tomorrow and will post if we find out what is causing this behaviour.

On Feb 15, 8:46 am, bdonnovan bdonno...@googlemail.com wrote:

Thanks for the quick replies !

I tried using the es-enclosed Base64.encodeFromFile() utility before
and once more just now, no luck unfortunately.

On 14 Feb., 22:35, bdonnovan bdonno...@googlemail.com wrote:

Thanks for the hint.

I just tried tika directly and all the tested documents can be parsed
fully and correctly. I see the whole text to the very end.

But with ES still no luck, it also doesn't change a thing when i take
the pdf extracted text and put it into a plain text file and try to
drop the index and feed it again with that, the result is the same.

Is there be any sort of configuration that could be messed up? I
didn't change anything though, so this is rather unlikely.

On 14 Feb., 21:51, ppearcy ppea...@gmail.com wrote:

Elasticsearch integrates tika for this:http://tika.apache.org/

I'd recommend downloading tika seperately and running the
"tika-app",

a nice little gui tool you can drag and drop files on to. I'm pretty
sure elasticsearch is on tika 1.0.

This will show you what it is doing under the hood and might shed
some

light on things.

Best Regards,
Paul

On Feb 14, 1:32 pm, bdonnovan bdonno...@googlemail.com wrote:

Hey everyone,

i am pretty much a brand new user to elasticsearch and i very much
like it so far. I am using ES in version 18,7 with theattachment-
mapper plugin. It is properly installed and i see it getting
picked up

on engine startup. I am facing a weird problem trying to index
text/

plain or pdf content. When i provide the base64 encoded content,
and

start the indexing , everything ends fine with no errors thrown,
but

when i check the index or try to query the index i see that only a
part of the document has been successfully indexed, after that
certain

point everything else is just missing and cannot be queried. I
tried

multiple files now pdf and text files (length of 2 to 3MB), yet i
have

the same effect each and every time, It seems that there are no
more

than 4000 terms that get indexed and i wonder why. I set logging
to

debug, and i see no warnings or errors at all while indexing. I
guess

i am doing something wrong and i hope someone out there has a clue
what it could be. It would be much appreciated.

My code basically looks like this:https://gist.github.com/1830040

And one of the failing test documents would be this pdf one:

http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&ved=0C...

Any suggestions on how to get it right ?


(arien) #8

Please reply to this thread once u submit a patch or give a path of gist.
Thanks;

On Fri, Mar 2, 2012 at 11:22 AM, Henac pt.frost@gmail.com wrote:

I think I have found what the issue is. The problem is in the org.
elasticsearch.index.mapper.attachment.AttachmentMapper Class, where tika
is being called to extract the text from the document.

parsedContent = tika().parseToString(new FastByteArrayInputStream(content
), metadata);

The javadoc for parseToString() states: "To avoid unpredictable excess
memory use, the returned string contains only up to getMaxStringLength()http://tika.apache.org/0.10/api/org/apache/tika/Tika.html#getMaxStringLength()
first characters extracted from the input document. Use the
setMaxStringLength(int)http://tika.apache.org/0.10/api/org/apache/tika/Tika.html#setMaxStringLength(int)
method to adjust this limitation." and the default value for
maxStringLength = 100,000.

I am going to submit a patch to have setMaxStringLength(inthttp://tika.apache.org/0.10/api/org/apache/tika/Tika.html#setMaxStringLength(int)
) called prior to the parseToString() with a value from the elastic
search configuration file.

Cheers

On Thursday, 1 March 2012 20:18:04 UTC+11, Henac wrote:

I am getting exactly the same behaviour in my app. I will debugging
tomorrow and will post if we find out what is causing this behaviour.

On Feb 15, 8:46 am, bdonnovan bdonno...@googlemail.com wrote:

Thanks for the quick replies !

I tried using the es-enclosed Base64.encodeFromFile() utility before
and once more just now, no luck unfortunately.

On 14 Feb., 22:35, bdonnovan bdonno...@googlemail.com wrote:

Thanks for the hint.

I just tried tika directly and all the tested documents can be parsed
fully and correctly. I see the whole text to the very end.

But with ES still no luck, it also doesn't change a thing when i take
the pdf extracted text and put it into a plain text file and try to
drop the index and feed it again with that, the result is the same.

Is there be any sort of configuration that could be messed up? I
didn't change anything though, so this is rather unlikely.

On 14 Feb., 21:51, ppearcy ppea...@gmail.com wrote:

Elasticsearch integrates tika for this:http://tika.apache.org/

I'd recommend downloading tika seperately and running the
"tika-app",

a nice little gui tool you can drag and drop files on to. I'm
pretty

sure elasticsearch is on tika 1.0.

This will show you what it is doing under the hood and might shed
some

light on things.

Best Regards,
Paul

On Feb 14, 1:32 pm, bdonnovan bdonno...@googlemail.com wrote:

Hey everyone,

i am pretty much a brand new user to elasticsearch and i very
much

like it so far. I am using ES in version 18,7 with theattachment-
mapper plugin. It is properly installed and i see it getting
picked up

on engine startup. I am facing a weird problem trying to index
text/

plain or pdf content. When i provide the base64 encoded content,
and

start the indexing , everything ends fine with no errors thrown,
but

when i check the index or try to query the index i see that only
a

part of the document has been successfully indexed, after that
certain

point everything else is just missing and cannot be queried. I
tried

multiple files now pdf and text files (length of 2 to 3MB), yet i
have

the same effect each and every time, It seems that there are no
more

than 4000 terms that get indexed and i wonder why. I set logging
to

debug, and i see no warnings or errors at all while indexing. I
guess

i am doing something wrong and i hope someone out there has a
clue

what it could be. It would be much appreciated.

My code basically looks like this:https://gist.github.com/**
1830040 https://gist.github.com/1830040

And one of the failing test documents would be this pdf one:

http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&
cd=3&ved=0C.http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&ved=0C...

Any suggestions on how to get it right ?


(David Pilato) #9

Nice catch. May I suggest that this length property should be in the attachment mapping definition ?
We will be able to define what we need for each ES type.

David :wink:
@dadoonet

Le 2 mars 2012 à 06:52, Henac pt.frost@gmail.com a écrit :

I think I have found what the issue is. The problem is in the org.elasticsearch.index.mapper.attachment.AttachmentMapper Class, where tika is being called to extract the text from the document.

parsedContent = tika().parseToString(new FastByteArrayInputStream(content), metadata);

The javadoc for parseToString() states: "To avoid unpredictable excess memory use, the returned string contains only up to getMaxStringLength() first characters extracted from the input document. Use the setMaxStringLength(int) method to adjust this limitation." and the default value for maxStringLength = 100,000.

I am going to submit a patch to have setMaxStringLength(int) called prior to the parseToString() with a value from the elastic search configuration file.

Cheers

On Thursday, 1 March 2012 20:18:04 UTC+11, Henac wrote:
I am getting exactly the same behaviour in my app. I will debugging
tomorrow and will post if we find out what is causing this behaviour.

On Feb 15, 8:46 am, bdonnovan bdonno...@googlemail.com wrote:

Thanks for the quick replies !

I tried using the es-enclosed Base64.encodeFromFile() utility before
and once more just now, no luck unfortunately.

On 14 Feb., 22:35, bdonnovan bdonno...@googlemail.com wrote:

Thanks for the hint.

I just tried tika directly and all the tested documents can be parsed
fully and correctly. I see the whole text to the very end.

But with ES still no luck, it also doesn't change a thing when i take
the pdf extracted text and put it into a plain text file and try to
drop the index and feed it again with that, the result is the same.

Is there be any sort of configuration that could be messed up? I
didn't change anything though, so this is rather unlikely.

On 14 Feb., 21:51, ppearcy ppea...@gmail.com wrote:

Elasticsearch integrates tika for this:http://tika.apache.org/

I'd recommend downloading tika seperately and running the "tika-app",
a nice little gui tool you can drag and drop files on to. I'm pretty
sure elasticsearch is on tika 1.0.

This will show you what it is doing under the hood and might shed some
light on things.

Best Regards,
Paul

On Feb 14, 1:32 pm, bdonnovan bdonno...@googlemail.com wrote:

Hey everyone,

i am pretty much a brand new user to elasticsearch and i very much
like it so far. I am using ES in version 18,7 with theattachment-
mapper plugin. It is properly installed and i see it getting picked up
on engine startup. I am facing a weird problem trying to index text/
plain or pdf content. When i provide the base64 encoded content, and
start the indexing , everything ends fine with no errors thrown, but
when i check the index or try to query the index i see that only a
part of the document has been successfully indexed, after that certain
point everything else is just missing and cannot be queried. I tried
multiple files now pdf and text files (length of 2 to 3MB), yet i have
the same effect each and every time, It seems that there are no more
than 4000 terms that get indexed and i wonder why. I set logging to
debug, and i see no warnings or errors at all while indexing. I guess
i am doing something wrong and i hope someone out there has a clue
what it could be. It would be much appreciated.

My code basically looks like this:https://gist.github.com/1830040

And one of the failing test documents would be this pdf one:

http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&ved=0C...

Any suggestions on how to get it right ?


(Henac) #10

Good suggestion. Hoping to have it up in the next day or so.


(Henac) #11

I have patched the elastic search-mapper plugin (
https://github.com/Henac/elasticsearch-mapper-attachments) so that you can
specify the amount of text that can be extracted and indexed from each
uploaded document, and awaiting pull request.

By default, tika is only extracting a maximum of 100,000 characters from
the uploaded file attachment. I have modified it so that on upload, you can
specify the maximum amount of characters to extract from the document
(specify -1 to remove any limit).

Example usage:
{
"my_attachment" : {
"_content_length" : 500000,
"_content_type" : "application/pdf",
"_name" : "resource/name/of/my.pdf",
"content" : "... base64 encoded attachment ..."
}}

David Pilato made the suggestion to put this setting in the attachment
mapping definition, but I haven't done this as yet. The current
implementation of supplying the content limit on upload, provides a very
granular approach. BTW, if you don't specify the content_length, it will
default to tika's default of 100000. Also, be warned that specifying -1,
to remove the limit, may cause you memory issues if you start uploading
very large documents.

On Wednesday, 15 February 2012 07:32:52 UTC+11, bdonnovan wrote:

Hey everyone,

i am pretty much a brand new user to elasticsearch and i very much
like it so far. I am using ES in version 18,7 with the attachment-
mapper plugin. It is properly installed and i see it getting picked up
on engine startup. I am facing a weird problem trying to index text/
plain or pdf content. When i provide the base64 encoded content, and
start the indexing , everything ends fine with no errors thrown, but
when i check the index or try to query the index i see that only a
part of the document has been successfully indexed, after that certain
point everything else is just missing and cannot be queried. I tried
multiple files now pdf and text files (length of 2 to 3MB), yet i have
the same effect each and every time, It seems that there are no more
than 4000 terms that get indexed and i wonder why. I set logging to
debug, and i see no warnings or errors at all while indexing. I guess
i am doing something wrong and i hope someone out there has a clue
what it could be. It would be much appreciated.

My code basically looks like this:
https://gist.github.com/1830040

And one of the failing test documents would be this pdf one:

http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&ved=0CDcQFjAC&url=http%3A%2F%2Fwww-ivs.cs.uni-magdeburg.de%2Fsw-eng%2Fagruppe%2Fforschung%2Fdiplomarbeiten%2Frene.pdf&ei=EMQ6T-z3K4_asgbAqLTlBg&usg=AFQjCNH948NSAdF_VaO_OUNQtXsCP1WRTQ

Any suggestions on how to get it right ?


(system) #12