Indexing HTML documents, problems with JSON

Hi,

I have a bunch of html documents that I would like to index (around
3000, so not so many). I put the title as well as some other metadata
in separate properties but I would like to make the content searchable
as well, and I would also like to be able to display the orignal
document... and I would like to do this over JSON... But:

"JSON does not look like XML, so HTML text fed to a JSON parser will
produce an error."

So im having problem parsing my hits back so...

How do you guys solve this... do you strip out the html out of the
document and only index the plain text content and then pull the
original from another database (based on an indexed id) or are there
other ways?

Sorry for the rather long post.

--Albin

Hi,

you should check attachment type:
http://www.elasticsearch.com/docs/elasticsearch/mapping/attachment/
Note that as of 0.12.0 the plugin needs to be installed in extracted form.
The best option is to use bin/plugin script to install plugins (or you can
do it manually, just create a "plugins" directory in ES HOME, unpack
particular plugin into this folder and you are done ... start ES).

But this way you will not get raw HTML in _source (it will be kept in base64
form). So either you can try decode it from result hits on client side or
you need to extract raw HTML before indexing, then escaping it to make it
JSON valid (shouldn't be that hard) and using html_strip filter (see
http://www.elasticsearch.com/docs/elasticsearch/index_modules/analysis/charfilter/
,
for more details:
Analysis: Add `char_filter` on top of `tokenizer`, `filter`, and `analyzer`. Add an `html_strip` char filter and `standard_html_strip` analyzer · Issue #315 · elastic/elasticsearch · GitHub). However, I
did not try it myself yet.

Regards,
Lukas

On Mon, Oct 25, 2010 at 4:29 PM, Albin Stigo albin.stigo@gmail.com wrote:

Hi,

I have a bunch of html documents that I would like to index (around
3000, so not so many). I put the title as well as some other metadata
in separate properties but I would like to make the content searchable
as well, and I would also like to be able to display the orignal
document... and I would like to do this over JSON... But:

"JSON does not look like XML, so HTML text fed to a JSON parser will
produce an error."

So im having problem parsing my hits back so...

How do you guys solve this... do you strip out the html out of the
document and only index the plain text content and then pull the
original from another database (based on an indexed id) or are there
other ways?

Sorry for the rather long post.

--Albin

Regarding feeding html within the json itself, not sure how you generate the
json, but most "to_json" utils/libs also escape relevant characters.

On Mon, Oct 25, 2010 at 8:46 PM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Hi,

you should check attachment type:
http://www.elasticsearch.com/docs/elasticsearch/mapping/attachment/
Note that as of 0.12.0 the plugin needs to be installed in extracted form.
The best option is to use bin/plugin script to install plugins (or you can
do it manually, just create a "plugins" directory in ES HOME, unpack
particular plugin into this folder and you are done ... start ES).

But this way you will not get raw HTML in _source (it will be kept in
base64 form). So either you can try decode it from result hits on client
side or you need to extract raw HTML before indexing, then escaping it to
make it JSON valid (shouldn't be that hard) and using html_strip filter
(see
http://www.elasticsearch.com/docs/elasticsearch/index_modules/analysis/charfilter/ ,
for more details:
Analysis: Add `char_filter` on top of `tokenizer`, `filter`, and `analyzer`. Add an `html_strip` char filter and `standard_html_strip` analyzer · Issue #315 · elastic/elasticsearch · GitHub). However,
I did not try it myself yet.

Regards,
Lukas

On Mon, Oct 25, 2010 at 4:29 PM, Albin Stigo albin.stigo@gmail.comwrote:

Hi,

I have a bunch of html documents that I would like to index (around
3000, so not so many). I put the title as well as some other metadata
in separate properties but I would like to make the content searchable
as well, and I would also like to be able to display the orignal
document... and I would like to do this over JSON... But:

"JSON does not look like XML, so HTML text fed to a JSON parser will
produce an error."

So im having problem parsing my hits back so...

How do you guys solve this... do you strip out the html out of the
document and only index the plain text content and then pull the
original from another database (based on an indexed id) or are there
other ways?

Sorry for the rather long post.

--Albin

Ok!

I'm using node.js which as you know is running on V8 and
JSON.parse(str) actually returns an error (on some hits) when trying
to parse back to JSON. But of course I managed to put it into ES using
JSON in the first place, using a python script with the json package
and that escaped everything fine.

Maybe it's a bug in node.js..

--Albin

On Wed, Oct 27, 2010 at 12:10 AM, Shay Banon
shay.banon@elasticsearch.com wrote:

Regarding feeding html within the json itself, not sure how you generate the
json, but most "to_json" utils/libs also escape relevant characters.

On Mon, Oct 25, 2010 at 8:46 PM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Hi,
you should check attachment
type: http://www.elasticsearch.com/docs/elasticsearch/mapping/attachment/
Note that as of 0.12.0 the plugin needs to be installed in extracted form.
The best option is to use bin/plugin script to install plugins (or you can
do it manually, just create a "plugins" directory in ES HOME, unpack
particular plugin into this folder and you are done ... start ES).
But this way you will not get raw HTML in _source (it will be kept in
base64 form). So either you can try decode it from result hits on client
side or you need to extract raw HTML before indexing, then escaping it to
make it JSON valid (shouldn't be that hard) and using html_strip filter
(see http://www.elasticsearch.com/docs/elasticsearch/index_modules/analysis/charfilter/ ,
for more
details: Analysis: Add `char_filter` on top of `tokenizer`, `filter`, and `analyzer`. Add an `html_strip` char filter and `standard_html_strip` analyzer · Issue #315 · elastic/elasticsearch · GitHub).
However, I did not try it myself yet.
Regards,
Lukas
On Mon, Oct 25, 2010 at 4:29 PM, Albin Stigo albin.stigo@gmail.com
wrote:

Hi,

I have a bunch of html documents that I would like to index (around
3000, so not so many). I put the title as well as some other metadata
in separate properties but I would like to make the content searchable
as well, and I would also like to be able to display the orignal
document... and I would like to do this over JSON... But:

"JSON does not look like XML, so HTML text fed to a JSON parser will
produce an error."

So im having problem parsing my hits back so...

How do you guys solve this... do you strip out the html out of the
document and only index the plain text content and then pull the
original from another database (based on an indexed id) or are there
other ways?

Sorry for the rather long post.

--Albin

On Wed, 2010-10-27 at 10:05 +0200, Albin Stigo wrote:

Ok!

I'm using node.js which as you know is running on V8 and
JSON.parse(str) actually returns an error (on some hits) when trying
to parse back to JSON. But of course I managed to put it into ES using
JSON in the first place, using a python script with the json package
and that escaped everything fine.

Request the doc that is throwing the error in node.js using curl from
the command line. Have a look at what is in the _source field - try to
decode it from JSON using python and node.js

It's more likely that there was a bug putting the _source INTO ES, than
the other way around

clint