I am using the attachment plugin to index HTML documents that are crawled
from ManifoldCF. I don't think I have any control over the document that
is submitted from ManifoldCF to Elasticsearch.
Is there any way to define/extend the Attachment plugin to do the following:
Store the Base64 decoded content into a field
Store the Base64 decoded content after having stripped the markup out into
another field
Extract the following elements from the Base64 decoded content, which are
"<meta htt-equiv..." fields in the HTML document.
keywords
description
Last-Modified
title
I have searched fairly extensively both the ES documentation site and
searched quite a bit on google, and have not turned up anything as of yet.
I may look into both attempting to extend the attachment plugin and look at
extending the ManifoldCF ElasticSearch connector to handle this, but am
hoping there is some mechanism already built into one of them that handles
this.
We use attachments plugin to parse mail attachments (which are usually PDF,
Word, Excel or Pages files, not much of HTML though) however, I do not
think the plugin has any extension to store decoded content into custom
fields. If I were you I would consider parsing the HTML files yourself
before you push them into ES. You will have full control over parsing and
you can easily avoid the situation where Apache Tika (the library used for
parsing under the hood) would fail to extract the content. Also parsing the
document in different JVM (different from ES node) might be more GC
friendly approach.
I am using the attachment plugin to index HTML documents that are crawled
from ManifoldCF. I don't think I have any control over the document that
is submitted from ManifoldCF to Elasticsearch.
Is there any way to define/extend the Attachment plugin to do the
following:
Store the Base64 decoded content into a field
Store the Base64 decoded content after having stripped the markup out into
another field
Extract the following elements from the Base64 decoded content, which are
"<meta htt-equiv..." fields in the HTML document.
keywords
description
Last-Modified
title
I have searched fairly extensively both the ES documentation site and
searched quite a bit on google, and have not turned up anything as of yet.
I may look into both attempting to extend the attachment plugin and look
at extending the ManifoldCF Elasticsearch connector to handle this, but am
hoping there is some mechanism already built into one of them that handles
this.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.