Is there any literature on how to index pdf documents and binary formats
like images?
Versioning question: If I update an already indexed document, I believe
ES will update the version number. I am wondering if it keeps the previous
document, what if I needed access to the previous document?
You can use Tika by yourself (recommended). See how I did it in fsriver project.
You can use mapper attachment plugin which is using Tika behind the scene but gives you less control IMHO.
About versions, elasticsearch does not keep old versions around. If you need that, you have to manage it yourself.
HTH
--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
Thanks for the reply. the attachment plugin I understand encodes content
before indexing it, this sounds like an expensive operation if we have lots
of pdfs. I was thinking extracting text from pdf early on instead and deal
with text instead.
Does the plugin also work for binaries like images?
On Thursday, January 16, 2014 4:12:47 PM UTC-5, David Pilato wrote:
You can use Tika by yourself (recommended). See how I did it in fsriver
project.
You can use mapper attachment plugin which is using Tika behind the scene
but gives you less control IMHO.
About versions, elasticsearch does not keep old versions around. If you
need that, you have to manage it yourself.
HTH
--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
Le 16 janv. 2014 à 20:42, ZenMaster80 <sabda...@gmail.com <javascript:>>
a écrit :
Is there any literature on how to index pdf documents and binary formats
like images?
Versioning question: If I update an already indexed document, I believe
ES will update the version number. I am wondering if it keeps the previous
document, what if I needed access to the previous document?
Thanks for the reply. the attachment plugin I understand encodes content before indexing it, this sounds like an expensive operation if we have lots of pdfs. I was thinking extracting text from pdf early on instead and deal with text instead.
Does the plugin also work for binaries like images?
On Thursday, January 16, 2014 4:12:47 PM UTC-5, David Pilato wrote:
You can use Tika by yourself (recommended). See how I did it in fsriver project.
You can use mapper attachment plugin which is using Tika behind the scene but gives you less control IMHO.
About versions, elasticsearch does not keep old versions around. If you need that, you have to manage it yourself.
HTH
--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.