Attachment Plugin

I am using the attachment plugin to index HTML documents that are crawled
from ManifoldCF. I don't think I have any control over the document that
is submitted from ManifoldCF to Elasticsearch.

Is there any way to define/extend the Attachment plugin to do the following:

Store the Base64 decoded content into a field
Store the Base64 decoded content after having stripped the markup out into
another field

Extract the following elements from the Base64 decoded content, which are
"<meta htt-equiv..." fields in the HTML document.

keywords
description
Last-Modified
title

I have searched fairly extensively both the ES documentation site and
searched quite a bit on google, and have not turned up anything as of yet.

I may look into both attempting to extend the attachment plugin and look at
extending the ManifoldCF ElasticSearch connector to handle this, but am
hoping there is some mechanism already built into one of them that handles
this.

Thanks,

--mike

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

AFAIK there was some basic tutorial on ES.org regarding attachments plugin.
You can still find it in commit history:
https://github.com/elasticsearch/elasticsearch.github.com/blob/4cb4cd8ae5cee812350d5ccd3664ec5bcc1943a3/tutorials/_posts/2011-07-18-attachment-type-in-action.textile

Not sure if there is any official tutorial now.

We use attachments plugin to parse mail attachments (which are usually PDF,
Word, Excel or Pages files, not much of HTML though) however, I do not
think the plugin has any extension to store decoded content into custom
fields. If I were you I would consider parsing the HTML files yourself
before you push them into ES. You will have full control over parsing and
you can easily avoid the situation where Apache Tika (the library used for
parsing under the hood) would fail to extract the content. Also parsing the
document in different JVM (different from ES node) might be more GC
friendly approach.

Just my 2 cents.

Regards,
Lukas

On Sun, Apr 7, 2013 at 9:34 PM, mjk mj.kelleher@gmail.com wrote:

I am using the attachment plugin to index HTML documents that are crawled
from ManifoldCF. I don't think I have any control over the document that
is submitted from ManifoldCF to Elasticsearch.

Is there any way to define/extend the Attachment plugin to do the
following:

Store the Base64 decoded content into a field
Store the Base64 decoded content after having stripped the markup out into
another field

Extract the following elements from the Base64 decoded content, which are
"<meta htt-equiv..." fields in the HTML document.

keywords
description
Last-Modified
title

I have searched fairly extensively both the ES documentation site and
searched quite a bit on google, and have not turned up anything as of yet.

I may look into both attempting to extend the attachment plugin and look
at extending the ManifoldCF Elasticsearch connector to handle this, but am
hoping there is some mechanism already built into one of them that handles
this.

Thanks,

--mike

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.