Indexing large pdf document

Hi,

I'm trying to index big document with ES and Mapper Attachment plugin
(https://github.com/elastic/elasticsearch-mapper-attachments). Document has
719 pages, but after indexing I can search phrases only up to page 33. When
I index a document I'm base64 encoding the file contents and file get
successfully added to the index. Is there some limits of the size of the
file?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/04dd35e4-1caf-4a30-8f24-13cf47907067%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

There is a limit of the number of extracted characters.

See https://github.com/elastic/elasticsearch-mapper-attachments#indexed-characters https://github.com/elastic/elasticsearch-mapper-attachments#indexed-characters

--
David Pilato - Developer | Evangelist
elastic.co
@dadoonet https://twitter.com/dadoonet | @elasticsearchfr https://twitter.com/elasticsearchfr | @scrutmydocs https://twitter.com/scrutmydocs

Le 26 mars 2015 à 10:51, Jakko Sikkar jakko.sikkar@gmail.com a écrit :

Hi,

I'm trying to index big document with ES and Mapper Attachment plugin (https://github.com/elastic/elasticsearch-mapper-attachments). Document has 719 pages, but after indexing I can search phrases only up to page 33. When I index a document I'm base64 encoding the file contents and file get successfully added to the index. Is there some limits of the size of the file?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/04dd35e4-1caf-4a30-8f24-13cf47907067%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/04dd35e4-1caf-4a30-8f24-13cf47907067%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/68560613-23C5-4398-A7F0-FEFBACF83DEA%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.

Thank you very much for pointing that out, I read documentation but skipped
that part somehow :slight_smile:

neljapäev, 26. märts 2015 12:51.50 UTC+2 kirjutas David Pilato:

There is a limit of the number of extracted characters.

See
https://github.com/elastic/elasticsearch-mapper-attachments#indexed-characters

--
David Pilato - Developer | Evangelist
elastic.co http://elastic.co
@dadoonet https://twitter.com/dadoonet | @elasticsearchfr
https://twitter.com/elasticsearchfr | @scrutmydocs
https://twitter.com/scrutmydocs

Le 26 mars 2015 à 10:51, Jakko Sikkar <jakko....@gmail.com <javascript:>>
a écrit :

Hi,

I'm trying to index big document with ES and Mapper Attachment plugin (
https://github.com/elastic/elasticsearch-mapper-attachments). Document
has 719 pages, but after indexing I can search phrases only up to page 33.
When I index a document I'm base64 encoding the file contents and file get
successfully added to the index. Is there some limits of the size of the
file?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/04dd35e4-1caf-4a30-8f24-13cf47907067%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/04dd35e4-1caf-4a30-8f24-13cf47907067%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ff655c88-1e8a-4703-935a-f0136deee442%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi,

Can you please share your ES configuration? I want to index many large PDFs, but not sure how to include them together. Do I have to write curl for each separately? Also, need help with this syntax --

PUT /test-mapping/person/1
{
"my_attachment" : {
"_name" : "/home/ubuntu/test.pdf",
"_language" : "en",
"_content" : "... base64 encoded attachment ..." ---> Do I have to write content even if I know I want to index the complete file? How to specify the location to read from?
}
}

This thread is 10 months old. I think it would be better if you started a new thread for this.

Do I have to write content even if I know I want to index the complete file? How to specify the location to read from?

You can't specify the location to read from with the mapper attachments plugin.

Hello Sir?
I am new in elasticsearch but I like the way it is a power ful tool
Can you help me please see the documentation we have been using ti index even one pdf file?
Best regards!

Hi,

You can definitely index PDF document provided they are encoded in base-64

https://discuss.elastic.co/t/logstash-parsing-for-rich-text-documents/40640/3

Hello,
Thank you so much I have tried it works!
Cheers!!!!!!!!!!!!!!!!!!!!!!!!!!

Hello again?

I met a problem with my elasticsearch in production it is stopping after two days and I am not able to figure out what I am missing.


Any idea would be welcomed!

Best regards