Getting the extracted content from the attachment mapper plugin

Hi there,

I just played around with the attachment mapper plugin and wondered if
I can access the parsedContent (as in AttachmentMapper.java:309),
which contains the tika-parsed content of the document, in any way.
When doing a simple GET on the document I only see the base64 encoded
value which I pushed.

I'd like to do some special text extraction in my documents (like
searching for dates in them) after indexing. Alternatively I could
call the tika code a second time in my own application, seems a bit
dirty though.

Any hints appreciated.. possibly I just overlooked something when
skimming through the source and it is totally easy :slight_smile:

--Alexander

--

I tried also to play with it some time ago but did not succeed with mimetype autodetection. :frowning:

I posted here something about it without answer: https://groups.google.com/forum/m/?fromgroups#!search/Tika$20Pilato$20content$20type/elasticsearch/Ne5_uOKlAAk

So if someone answers to Alexander, it would be nice to have a look also at my old post. :-/

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 2 nov. 2012 à 13:14, Alexander Reelsen alr@spinscale.de a écrit :

Hi there,

I just played around with the attachment mapper plugin and wondered if
I can access the parsedContent (as in AttachmentMapper.java:309),
which contains the tika-parsed content of the document, in any way.
When doing a simple GET on the document I only see the base64 encoded
value which I pushed.

I'd like to do some special text extraction in my documents (like
searching for dates in them) after indexing. Alternatively I could
call the tika code a second time in my own application, seems a bit
dirty though.

Any hints appreciated.. possibly I just overlooked something when
skimming through the source and it is totally easy :slight_smile:

--Alexander

--

--

I noticed this statement too in the attachment mapper plugin, because I was
interested in handling binary content with ES.

How do you want the binary content to be exposed? At shard level? Or via
node transport?

The REST API uses JSON (XContentBuilder) which is the reason for base64.
With websockets, I can imagine direct exposition of binary streams to a
Java client, but it's still a lot to do when transporting huge data from
the shard level to the requesting node without blowing up the heap (e.g.
chunked streams).

Jörg

On Friday, November 2, 2012 1:51:00 PM UTC+1, David Pilato wrote:

I tried also to play with it some time ago but did not succeed with
mimetype autodetection. :frowning:

I posted here something about it without answer:
Redirecting to Google Groups

So if someone answers to Alexander, it would be nice to have a look also
at my old post. :-/

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 2 nov. 2012 à 13:14, Alexander Reelsen <a...@spinscale.de <javascript:>>
a écrit :

Hi there,

I just played around with the attachment mapper plugin and wondered if
I can access the parsedContent (as in AttachmentMapper.java:309),
which contains the tika-parsed content of the document, in any way.
When doing a simple GET on the document I only see the base64 encoded
value which I pushed.

I'd like to do some special text extraction in my documents (like
searching for dates in them) after indexing. Alternatively I could
call the tika code a second time in my own application, seems a bit
dirty though.

Any hints appreciated.. possibly I just overlooked something when
skimming through the source and it is totally easy :slight_smile:

--Alexander

--

--

did you solve this somehow?
is it at all possible to access parsed plain text via REST API?

@Jörg Prante: why binary? why not plaintext?

On Friday, 2 November 2012 23:14:52 UTC+11, Alexander Reelsen wrote:

Hi there,

I just played around with the attachment mapper plugin and wondered if
I can access the parsedContent (as in AttachmentMapper.java:309),
which contains the tika-parsed content of the document, in any way.
When doing a simple GET on the document I only see the base64 encoded
value which I pushed.

I'd like to do some special text extraction in my documents (like
searching for dates in them) after indexing. Alternatively I could
call the tika code a second time in my own application, seems a bit
dirty though.

Any hints appreciated.. possibly I just overlooked something when
skimming through the source and it is totally easy :slight_smile:

--Alexander

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Yes. Just store fields you need and ask for it when searching.
Have a look at this GIST: Testing FSRiver with Mapper attachment and check metadata extracted · GitHub

It gives some clues (using FSRiver but you can run the same test without FSRiver)

HTH

David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr | @scrutmydocs

Le 23 avr. 2013 à 15:42, gondo gondar@webdesigners.sk a écrit :

did you solve this somehow?
is it at all possible to access parsed plain text via REST API?

@Jörg Prante: why binary? why not plaintext?

On Friday, 2 November 2012 23:14:52 UTC+11, Alexander Reelsen wrote:
Hi there,

I just played around with the attachment mapper plugin and wondered if
I can access the parsedContent (as in AttachmentMapper.java:309),
which contains the tika-parsed content of the document, in any way.
When doing a simple GET on the document I only see the base64 encoded
value which I pushed.

I'd like to do some special text extraction in my documents (like
searching for dates in them) after indexing. Alternatively I could
call the tika code a second time in my own application, seems a bit
dirty though.

Any hints appreciated.. possibly I just overlooked something when
skimming through the source and it is totally easy :slight_smile:

--Alexander

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

hi
thanks David, that helped a little.
although it looks like the file is stored as base64 version of original
file, and text extraction by tika is done during indexation AND also during
extraction, am i right?
i was hoping to store just the plaintext (without original document in any
format), but i guess i ll have to extract text separately by using tika and
then check and store just the result
anyways thanks for making it more clear for me

On Friday, 2 November 2012 23:14:52 UTC+11, Alexander Reelsen wrote:

Hi there,

I just played around with the attachment mapper plugin and wondered if
I can access the parsedContent (as in AttachmentMapper.java:309),
which contains the tika-parsed content of the document, in any way.
When doing a simple GET on the document I only see the base64 encoded
value which I pushed.

I'd like to do some special text extraction in my documents (like
searching for dates in them) after indexing. Alternatively I could
call the tika code a second time in my own application, seems a bit
dirty though.

Any hints appreciated.. possibly I just overlooked something when
skimming through the source and it is totally easy :slight_smile:

--Alexander

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.