ManifoldCF and Elasticsearch - content base64 encoded?

Steve_Corey · December 1, 2015, 3:20pm

I’m putting together a proof-of-concept for crawling our website content with MCF, and indexing it with ES. At a basic level, everything seems to be working. What I’m trying to understand is that when MCF indexes web content, the HTML is stored inside an object called file in a property called _content. When this is added to the ES index, all the HTML is Base64 encoded. I believe this is preventing ES from property searching the field.

Is this Base64 encoding to be expected, or do I need to change something?

Does anyone have a walkthrough of using MCF to crawl web content, and output to ES? I’ve seen many many guides for both systems, but never something that combines the two. I’d prefer to avoid using Nutch for crawling, since it lacks any UI for management.

dadoonet · December 1, 2015, 3:38pm

I have no idea but it sounds like to me that they expect you have the mapper attachments plugin and defined your field as attachment type.

ClaudiuStack · February 15, 2017, 11:31am

I have the same problem. Did you find any solution to fix this?

Topic		Replies	Views
Attachment Plugin Elasticsearch	2	395	July 6, 2017
ElasticSearch-mapper-attachement plugin Elasticsearch	1	340	July 6, 2017
How to encode content of web page or file attachment with elasticsearch-river-mongodb Elasticsearch	5	371	July 6, 2017
How to search a encoded content Elasticsearch	3	416	December 16, 2016
Index HTML documents Elasticsearch	4	2628	July 6, 2017

ManifoldCF and Elasticsearch - content base64 encoded?

Related topics