Indexing HTML documents, problems with JSON

Albin_Stigo · October 25, 2010, 2:29pm

Hi,

I have a bunch of html documents that I would like to index (around
3000, so not so many). I put the title as well as some other metadata
in separate properties but I would like to make the content searchable
as well, and I would also like to be able to display the orignal
document... and I would like to do this over JSON... But:

"JSON does not look like XML, so HTML text fed to a JSON parser will
produce an error."

So im having problem parsing my hits back so...

How do you guys solve this... do you strip out the html out of the
document and only index the plain text content and then pull the
original from another database (based on an indexed id) or are there
other ways?

Sorry for the rather long post.

--Albin

Lukas_Vlcek1 · October 25, 2010, 6:46pm

Hi,

you should check attachment type:
http://www.elasticsearch.com/docs/elasticsearch/mapping/attachment/
Note that as of 0.12.0 the plugin needs to be installed in extracted form.
The best option is to use bin/plugin script to install plugins (or you can
do it manually, just create a "plugins" directory in ES HOME, unpack
particular plugin into this folder and you are done ... start ES).

But this way you will not get raw HTML in _source (it will be kept in base64
form). So either you can try decode it from result hits on client side or
you need to extract raw HTML before indexing, then escaping it to make it
JSON valid (shouldn't be that hard) and using html_strip filter (see
http://www.elasticsearch.com/docs/elasticsearch/index_modules/analysis/charfilter/
,
for more details:
Analysis: Add `char_filter` on top of `tokenizer`, `filter`, and `analyzer`. Add an `html_strip` char filter and `standard_html_strip` analyzer · Issue #315 · elastic/elasticsearch · GitHub). However, I
did not try it myself yet.

Regards,
Lukas

On Mon, Oct 25, 2010 at 4:29 PM, Albin Stigo albin.stigo@gmail.com wrote:

Hi,

I have a bunch of html documents that I would like to index (around
3000, so not so many). I put the title as well as some other metadata
in separate properties but I would like to make the content searchable
as well, and I would also like to be able to display the orignal
document... and I would like to do this over JSON... But:

"JSON does not look like XML, so HTML text fed to a JSON parser will
produce an error."

So im having problem parsing my hits back so...

How do you guys solve this... do you strip out the html out of the
document and only index the plain text content and then pull the
original from another database (based on an indexed id) or are there
other ways?

Sorry for the rather long post.

--Albin

kimchy · October 26, 2010, 10:10pm

Regarding feeding html within the json itself, not sure how you generate the
json, but most "to_json" utils/libs also escape relevant characters.

On Mon, Oct 25, 2010 at 8:46 PM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Hi,

you should check attachment type:
http://www.elasticsearch.com/docs/elasticsearch/mapping/attachment/
Note that as of 0.12.0 the plugin needs to be installed in extracted form.
The best option is to use bin/plugin script to install plugins (or you can
do it manually, just create a "plugins" directory in ES HOME, unpack
particular plugin into this folder and you are done ... start ES).

But this way you will not get raw HTML in _source (it will be kept in
base64 form). So either you can try decode it from result hits on client
side or you need to extract raw HTML before indexing, then escaping it to
make it JSON valid (shouldn't be that hard) and using html_strip filter
(see
http://www.elasticsearch.com/docs/elasticsearch/index_modules/analysis/charfilter/ ,
for more details:
Analysis: Add `char_filter` on top of `tokenizer`, `filter`, and `analyzer`. Add an `html_strip` char filter and `standard_html_strip` analyzer · Issue #315 · elastic/elasticsearch · GitHub). However,
I did not try it myself yet.

Regards,
Lukas

On Mon, Oct 25, 2010 at 4:29 PM, Albin Stigo albin.stigo@gmail.comwrote:

Hi,

I have a bunch of html documents that I would like to index (around
3000, so not so many). I put the title as well as some other metadata
in separate properties but I would like to make the content searchable
as well, and I would also like to be able to display the orignal
document... and I would like to do this over JSON... But:

"JSON does not look like XML, so HTML text fed to a JSON parser will
produce an error."

So im having problem parsing my hits back so...

How do you guys solve this... do you strip out the html out of the
document and only index the plain text content and then pull the
original from another database (based on an indexed id) or are there
other ways?

Sorry for the rather long post.

--Albin

Albin_Stigo · October 27, 2010, 8:05am

Ok!

I'm using node.js which as you know is running on V8 and
JSON.parse(str) actually returns an error (on some hits) when trying
to parse back to JSON. But of course I managed to put it into ES using
JSON in the first place, using a python script with the json package
and that escaped everything fine.

Maybe it's a bug in node.js..

--Albin

On Wed, Oct 27, 2010 at 12:10 AM, Shay Banon
shay.banon@elasticsearch.com wrote:

Regarding feeding html within the json itself, not sure how you generate the
json, but most "to_json" utils/libs also escape relevant characters.

On Mon, Oct 25, 2010 at 8:46 PM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Hi,
you should check attachment
type: http://www.elasticsearch.com/docs/elasticsearch/mapping/attachment/
Note that as of 0.12.0 the plugin needs to be installed in extracted form.
The best option is to use bin/plugin script to install plugins (or you can
do it manually, just create a "plugins" directory in ES HOME, unpack
particular plugin into this folder and you are done ... start ES).
But this way you will not get raw HTML in _source (it will be kept in
base64 form). So either you can try decode it from result hits on client
side or you need to extract raw HTML before indexing, then escaping it to
make it JSON valid (shouldn't be that hard) and using html_strip filter
(see http://www.elasticsearch.com/docs/elasticsearch/index_modules/analysis/charfilter/ ,
for more
details: Analysis: Add `char_filter` on top of `tokenizer`, `filter`, and `analyzer`. Add an `html_strip` char filter and `standard_html_strip` analyzer · Issue #315 · elastic/elasticsearch · GitHub).
However, I did not try it myself yet.
Regards,
Lukas
On Mon, Oct 25, 2010 at 4:29 PM, Albin Stigo albin.stigo@gmail.com
wrote:

Hi,

I have a bunch of html documents that I would like to index (around
3000, so not so many). I put the title as well as some other metadata
in separate properties but I would like to make the content searchable
as well, and I would also like to be able to display the orignal
document... and I would like to do this over JSON... But:

"JSON does not look like XML, so HTML text fed to a JSON parser will
produce an error."

So im having problem parsing my hits back so...

How do you guys solve this... do you strip out the html out of the
document and only index the plain text content and then pull the
original from another database (based on an indexed id) or are there
other ways?

Sorry for the rather long post.

--Albin

Clinton_Gormley · October 27, 2010, 8:59am

On Wed, 2010-10-27 at 10:05 +0200, Albin Stigo wrote:

Ok!

I'm using node.js which as you know is running on V8 and
JSON.parse(str) actually returns an error (on some hits) when trying
to parse back to JSON. But of course I managed to put it into ES using
JSON in the first place, using a python script with the json package
and that escaped everything fine.

Request the doc that is throwing the error in node.js using curl from
the command line. Have a look at what is in the _source field - try to
decode it from JSON using python and node.js

It's more likely that there was a bug putting the _source INTO ES, than
the other way around

clint

Topic		Replies	Views
Indexing HTML Elasticsearch	5	743	July 6, 2017
Indexing of HTML content Elasticsearch	12	3294	July 6, 2017
Index HTML documents Elasticsearch	4	2713	July 6, 2017
Ingest attachment plugin not analysing some html files Elasticsearch	15	1252	March 30, 2018
Strip_HTML on indexing does not store results? Elasticsearch	10	966	July 6, 2017

Indexing HTML documents, problems with JSON

Related topics