I have a bunch of html documents that I would like to index (around
3000, so not so many). I put the title as well as some other metadata
in separate properties but I would like to make the content searchable
as well, and I would also like to be able to display the orignal
document... and I would like to do this over JSON... But:
"JSON does not look like XML, so HTML text fed to a JSON parser will
produce an error."
So im having problem parsing my hits back so...
How do you guys solve this... do you strip out the html out of the
document and only index the plain text content and then pull the
original from another database (based on an indexed id) or are there
other ways?
you should check attachment type: http://www.elasticsearch.com/docs/elasticsearch/mapping/attachment/
Note that as of 0.12.0 the plugin needs to be installed in extracted form.
The best option is to use bin/plugin script to install plugins (or you can
do it manually, just create a "plugins" directory in ES HOME, unpack
particular plugin into this folder and you are done ... start ES).
I have a bunch of html documents that I would like to index (around
3000, so not so many). I put the title as well as some other metadata
in separate properties but I would like to make the content searchable
as well, and I would also like to be able to display the orignal
document... and I would like to do this over JSON... But:
"JSON does not look like XML, so HTML text fed to a JSON parser will
produce an error."
So im having problem parsing my hits back so...
How do you guys solve this... do you strip out the html out of the
document and only index the plain text content and then pull the
original from another database (based on an indexed id) or are there
other ways?
you should check attachment type: http://www.elasticsearch.com/docs/elasticsearch/mapping/attachment/
Note that as of 0.12.0 the plugin needs to be installed in extracted form.
The best option is to use bin/plugin script to install plugins (or you can
do it manually, just create a "plugins" directory in ES HOME, unpack
particular plugin into this folder and you are done ... start ES).
I have a bunch of html documents that I would like to index (around
3000, so not so many). I put the title as well as some other metadata
in separate properties but I would like to make the content searchable
as well, and I would also like to be able to display the orignal
document... and I would like to do this over JSON... But:
"JSON does not look like XML, so HTML text fed to a JSON parser will
produce an error."
So im having problem parsing my hits back so...
How do you guys solve this... do you strip out the html out of the
document and only index the plain text content and then pull the
original from another database (based on an indexed id) or are there
other ways?
I'm using node.js which as you know is running on V8 and
JSON.parse(str) actually returns an error (on some hits) when trying
to parse back to JSON. But of course I managed to put it into ES using
JSON in the first place, using a python script with the json package
and that escaped everything fine.
I have a bunch of html documents that I would like to index (around
3000, so not so many). I put the title as well as some other metadata
in separate properties but I would like to make the content searchable
as well, and I would also like to be able to display the orignal
document... and I would like to do this over JSON... But:
"JSON does not look like XML, so HTML text fed to a JSON parser will
produce an error."
So im having problem parsing my hits back so...
How do you guys solve this... do you strip out the html out of the
document and only index the plain text content and then pull the
original from another database (based on an indexed id) or are there
other ways?
On Wed, 2010-10-27 at 10:05 +0200, Albin Stigo wrote:
Ok!
I'm using node.js which as you know is running on V8 and
JSON.parse(str) actually returns an error (on some hits) when trying
to parse back to JSON. But of course I managed to put it into ES using
JSON in the first place, using a python script with the json package
and that escaped everything fine.
Request the doc that is throwing the error in node.js using curl from
the command line. Have a look at what is in the _source field - try to
decode it from JSON using python and node.js
It's more likely that there was a bug putting the _source INTO ES, than
the other way around
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.