Indexing PDF and other binary formats


(IronMike) #1
  • Is there any literature on how to index pdf documents and binary formats
    like images?
  • Versioning question: If I update an already indexed document, I believe
    ES will update the version number. I am wondering if it keeps the previous
    document, what if I needed access to the previous document?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a9e8f331-c4bd-4a4c-be5a-b91e4f2f0e26%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Pilato) #2

You can use Tika by yourself (recommended). See how I did it in fsriver project.
You can use mapper attachment plugin which is using Tika behind the scene but gives you less control IMHO.

About versions, elasticsearch does not keep old versions around. If you need that, you have to manage it yourself.

HTH

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 16 janv. 2014 à 20:42, ZenMaster80 sabdalla80@gmail.com a écrit :

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/F03436DE-657A-4D2C-A8E3-83E4B4D12523%40pilato.fr.
For more options, visit https://groups.google.com/groups/opt_out.


(IronMike) #3

Thanks for the reply. the attachment plugin I understand encodes content
before indexing it, this sounds like an expensive operation if we have lots
of pdfs. I was thinking extracting text from pdf early on instead and deal
with text instead.
Does the plugin also work for binaries like images?

On Thursday, January 16, 2014 4:12:47 PM UTC-5, David Pilato wrote:

You can use Tika by yourself (recommended). See how I did it in fsriver
project.
You can use mapper attachment plugin which is using Tika behind the scene
but gives you less control IMHO.

About versions, elasticsearch does not keep old versions around. If you
need that, you have to manage it yourself.

HTH

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 16 janv. 2014 à 20:42, ZenMaster80 <sabda...@gmail.com <javascript:>>
a écrit :

  • Is there any literature on how to index pdf documents and binary formats
    like images?
  • Versioning question: If I update an already indexed document, I believe
    ES will update the version number. I am wondering if it keeps the previous
    document, what if I needed access to the previous document?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a9e8f331-c4bd-4a4c-be5a-b91e4f2f0e26%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/94b706cf-c4de-4f94-87b7-48c9e6e814b0%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Pilato) #4

Yes. Some metadata are extracted with Tika.

As you said, you should do that operation before indexation (means only index what you really need).

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 16 janv. 2014 à 22:51, ZenMaster80 sabdalla80@gmail.com a écrit :

Thanks for the reply. the attachment plugin I understand encodes content before indexing it, this sounds like an expensive operation if we have lots of pdfs. I was thinking extracting text from pdf early on instead and deal with text instead.
Does the plugin also work for binaries like images?

On Thursday, January 16, 2014 4:12:47 PM UTC-5, David Pilato wrote:

You can use Tika by yourself (recommended). See how I did it in fsriver project.
You can use mapper attachment plugin which is using Tika behind the scene but gives you less control IMHO.

About versions, elasticsearch does not keep old versions around. If you need that, you have to manage it yourself.

HTH

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 16 janv. 2014 à 20:42, ZenMaster80 sabda...@gmail.com a écrit :

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/94b706cf-c4de-4f94-87b7-48c9e6e814b0%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6CD3EB4F-93DD-48BD-98F7-D14E3FDA88CA%40pilato.fr.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #5