Document indexing


(tullio0106) #1

I'd like to use elasticsearch to store indexes about my documents.
I've documents like .doc files or .pdf files or whatever.
Is there any way/tool to index such kind of documents ?
Tks
Tullio


(Rafał Kuć) #2

Hello!

Take a look at http://tika.apache.org/ framework. You can extract data
from files like PDF or DOC and then index that data into
ElasticSearch.

--
Regards,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

I'd like to use elasticsearch to store indexes about my documents.
I've documents like .doc files or .pdf files or whatever.
Is there any way/tool to index such kind of documents ?
Tks
Tullio


(David Pilato) #3

You could also use attachment plugin which will do the Tika job for you.

David :wink:
Twitter : @dadoonet / @elasticsearchfr

Le 10 mai 2012 à 16:05, Rafał Kuć r.kuc@solr.pl a écrit :

Hello!

Take a look at http://tika.apache.org/ framework. You can extract data
from files like PDF or DOC and then index that data into
ElasticSearch.

--
Regards,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

I'd like to use elasticsearch to store indexes about my documents.
I've documents like .doc files or .pdf files or whatever.
Is there any way/tool to index such kind of documents ?
Tks
Tullio


(tullio0106) #4

Where can I find the attachment plugin ?
Tks
Tullio

Il giorno giovedì 10 maggio 2012 15:57:13 UTC+2, tullio0106 ha scritto:

I'd like to use elasticsearch to store indexes about my documents.
I've documents like .doc files or .pdf files or whatever.
Is there any way/tool to index such kind of documents ?
Tks
Tullio


(David Pilato) #5


https://github.com/elasticsearch/elasticsearch-mapper-attachments
https://github.com/elasticsearch/elasticsearch-mapper-attachments

Le 10 mai 2012 à 16:24, tullio0106 tbettinazzi@axioma.it a écrit :

plugin ?
Tks
Tullio

Il giorno giovedì 10 maggio 2012 15:57:13 UTC+2, tullio0106 ha scritto:

store indexes about my documents.
I've documents like .doc files or .pdf files or whatever.
Is there any way/tool to index such kind of documents ?
Tks
Tullio =?f1a0ad7a-5daa-4ff9-a3d5-f1d42f61d6fc--

--
David Pilato
http://dev.david.pilato.fr/
Twitter : @dadoonet


(Andrew[.:at:.]DataFeedFile.com) #6

If you do not want to use Tika, I am sure you can also base64 encode
the file and stuff it into a giant string field yourself. Maybe?
I have never tried it, I think it should work.

--Andrew

On May 10, 8:57 am, tullio0106 tbettina...@axioma.it wrote:

I'd like to use elasticsearch to store indexes about my documents.
I've documents like .doc files or .pdf files or whatever.
Is there any way/tool to index such kind of documents ?
Tks
Tullio


(tullio0106) #7

Nothing against tika, but it's quite slow (I tried with a 3 MB pdf file and
the extraction time was 4 min.).
Base64 encoding don't seem to me a nice idea because every string would be
indexed, also escapes and meaningless string.
I hoped in a internal Elasticsearch tool avoind such complexities.
Tks
Tullio

Il giorno giovedì 10 maggio 2012 15:57:13 UTC+2, tullio0106 ha scritto:

I'd like to use elasticsearch to store indexes about my documents.
I've documents like .doc files or .pdf files or whatever.
Is there any way/tool to index such kind of documents ?
Tks
Tullio


(David Pilato) #8

With ES attachment plugin, I indexed more than 100 documents per second in
a "small cluster", 2 nodes, 8 Gb RAM.
Documents are pdf, oOo, jpeg, ...

So, may I suggest you give it a try ?

David.

Le 10 mai 2012 à 17:20, tullio0106 tbettinazzi@axioma.it a écrit :

slow (I tried with a 3 MB pdf file and the extraction time was 4 min.).
Base64 encoding don't seem to me a nice idea because every string would be
indexed, also escapes and meaningless string.
I hoped in a internal Elasticsearch tool avoind such complexities.
Tks
Tullio

Il giorno giovedì 10 maggio 2012 15:57:13 UTC+2, tullio0106 ha scritto:

store indexes about my documents.
I've documents like .doc files or .pdf files or whatever.
Is there any way/tool to index such kind of documents ?
Tks
Tullio =?f1a0ad7a-5daa-4ff9-a3d5-f1d42f61d6fc--

--
David Pilato
http://dev.david.pilato.fr/
Twitter : @dadoonet


(tullio0106) #9

Is there any Maven repository for attachment mapper ?
Where can I find it ?
Tks
Tullio


(Shay Banon) #10

Its under the same maven repo as elasticsearch main jar files.

On Sun, May 13, 2012 at 5:49 PM, tullio0106 tbettinazzi@axioma.it wrote:

Is there any Maven repository for attachment mapper ?
Where can I find it ?
Tks
Tullio

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Document-indexing-tp3977177p3984083.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(system) #11