Indexing PDF and other binary formats

IronMike · January 16, 2014, 7:42pm

Is there any literature on how to index pdf documents and binary formats
like images?
Versioning question: If I update an already indexed document, I believe
ES will update the version number. I am wondering if it keeps the previous
document, what if I needed access to the previous document?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a9e8f331-c4bd-4a4c-be5a-b91e4f2f0e26%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

dadoonet · January 16, 2014, 9:12pm

You can use Tika by yourself (recommended). See how I did it in fsriver project.
You can use mapper attachment plugin which is using Tika behind the scene but gives you less control IMHO.

About versions, elasticsearch does not keep old versions around. If you need that, you have to manage it yourself.

HTH

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 16 janv. 2014 à 20:42, ZenMaster80 sabdalla80@gmail.com a écrit :

Is there any literature on how to index pdf documents and binary formats like images?

Versioning question: If I update an already indexed document, I believe ES will update the version number. I am wondering if it keeps the previous document, what if I needed access to the previous document?
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a9e8f331-c4bd-4a4c-be5a-b91e4f2f0e26%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/F03436DE-657A-4D2C-A8E3-83E4B4D12523%40pilato.fr.
For more options, visit https://groups.google.com/groups/opt_out.

IronMike · January 16, 2014, 9:51pm

Thanks for the reply. the attachment plugin I understand encodes content
before indexing it, this sounds like an expensive operation if we have lots
of pdfs. I was thinking extracting text from pdf early on instead and deal
with text instead.
Does the plugin also work for binaries like images?

On Thursday, January 16, 2014 4:12:47 PM UTC-5, David Pilato wrote:

You can use Tika by yourself (recommended). See how I did it in fsriver
project.
You can use mapper attachment plugin which is using Tika behind the scene
but gives you less control IMHO.

About versions, elasticsearch does not keep old versions around. If you
need that, you have to manage it yourself.

HTH

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 16 janv. 2014 à 20:42, ZenMaster80 <sabda...@gmail.com <javascript:>>
a écrit :

Is there any literature on how to index pdf documents and binary formats
like images?

Versioning question: If I update an already indexed document, I believe
ES will update the version number. I am wondering if it keeps the previous
document, what if I needed access to the previous document?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a9e8f331-c4bd-4a4c-be5a-b91e4f2f0e26%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/94b706cf-c4de-4f94-87b7-48c9e6e814b0%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

dadoonet · January 16, 2014, 9:55pm

Yes. Some metadata are extracted with Tika.

As you said, you should do that operation before indexation (means only index what you really need).

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 16 janv. 2014 à 22:51, ZenMaster80 sabdalla80@gmail.com a écrit :

Thanks for the reply. the attachment plugin I understand encodes content before indexing it, this sounds like an expensive operation if we have lots of pdfs. I was thinking extracting text from pdf early on instead and deal with text instead.
Does the plugin also work for binaries like images?

On Thursday, January 16, 2014 4:12:47 PM UTC-5, David Pilato wrote:

You can use Tika by yourself (recommended). See how I did it in fsriver project.
You can use mapper attachment plugin which is using Tika behind the scene but gives you less control IMHO.

About versions, elasticsearch does not keep old versions around. If you need that, you have to manage it yourself.

HTH

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 16 janv. 2014 à 20:42, ZenMaster80 sabda...@gmail.com a écrit :

Is there any literature on how to index pdf documents and binary formats like images?

Versioning question: If I update an already indexed document, I believe ES will update the version number. I am wondering if it keeps the previous document, what if I needed access to the previous document?
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a9e8f331-c4bd-4a4c-be5a-b91e4f2f0e26%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/94b706cf-c4de-4f94-87b7-48c9e6e814b0%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6CD3EB4F-93DD-48BD-98F7-D14E3FDA88CA%40pilato.fr.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Upload/index document to Elastic Search Elasticsearch	3	360	July 6, 2017
Maintaining document version Elasticsearch	1	281	July 6, 2017
Document versioning Elasticsearch	3	316	July 6, 2017
ElasticSearch document versioning Elasticsearch	6	776	July 6, 2017
How can I see the version number of an indexed document when searching? Elasticsearch	2	306	July 6, 2017

Indexing PDF and other binary formats

Related topics