I have patched the Elasticsearch-mapper plugin (
GitHub - Henac/elasticsearch-mapper-attachments: Mapper Attachments Type plugin for ElasticSearch) so that you can
specify the amount of text that can be extracted and indexed from each
uploaded document, and awaiting pull request.
By default, tika is only extracting a maximum of 100,000 characters from
the uploaded file attachment. I have modified it so that on upload, you can
specify the maximum amount of characters to extract from the document
(specify -1 to remove any limit).
Example usage:
{
"my_attachment" : {
"_content_length" : 500000,
"_content_type" : "application/pdf",
"_name" : "resource/name/of/my.pdf",
"content" : "... base64 encoded attachment ..."
}}
David Pilato made the suggestion to put this setting in the attachment
mapping definition, but I haven't done this as yet. The current
implementation of supplying the content limit on upload, provides a very
granular approach. BTW, if you don't specify the content_length, it will
default to tika's default of 100000. Also, be warned that specifying -1,
to remove the limit, may cause you memory issues if you start uploading
very large documents.
On Wednesday, 15 February 2012 07:32:52 UTC+11, bdonnovan wrote:
Hey everyone,
i am pretty much a brand new user to elasticsearch and i very much
like it so far. I am using ES in version 18,7 with the attachment-
mapper plugin. It is properly installed and i see it getting picked up
on engine startup. I am facing a weird problem trying to index text/
plain or pdf content. When i provide the base64 encoded content, and
start the indexing , everything ends fine with no errors thrown, but
when i check the index or try to query the index i see that only a
part of the document has been successfully indexed, after that certain
point everything else is just missing and cannot be queried. I tried
multiple files now pdf and text files (length of 2 to 3MB), yet i have
the same effect each and every time, It seems that there are no more
than 4000 terms that get indexed and i wonder why. I set logging to
debug, and i see no warnings or errors at all while indexing. I guess
i am doing something wrong and i hope someone out there has a clue
what it could be. It would be much appreciated.
My code basically looks like this:
1830040’s gists · GitHub
And one of the failing test documents would be this pdf one:
Redirect Notice
Any suggestions on how to get it right ?