Indexing of HTML content

James_Cook · August 5, 2010, 5:50pm

I see that the attachments plugin uses Tika under the hood to intelligently
index text content embedded in other formats.

We have a situation where we are storing a field in our JSON object that can
contain HTML content. (Think the content of a discussion thread.)

Is there a strategy of technique to make sure the HTML tags are not indexed?
An existing mapping type?

Lukas_Vlcek1 · August 5, 2010, 6:27pm

Hi,

AFAIK Tika is using TagSoup for parsing HTML documents. So you can either
use directly TagSoup or you can try Tika's Swing GUI to test output of your
documents first.
See Apache Tika – Getting Started with Apache Tika where in "Using Tika as a
command line utility" you can find option -g or --gui

But if you have a json document that can contain HTML inside (as a value of
some property) then I think you will have to do this manually. The way
attachment plugin works is that it assumes that whole input document is of
the same content-type, there is no support for parsing documents with nested
content-types.

Regards,
Lukas

On Thu, Aug 5, 2010 at 7:50 PM, James Cook jcook@tracermedia.com wrote:

I see that the attachments plugin uses Tika under the hood to intelligently
index text content embedded in other formats.

We have a situation where we are storing a field in our JSON object that
can contain HTML content. (Think the content of a discussion thread.)

Is there a strategy of technique to make sure the HTML tags are not
indexed? An existing mapping type?

James_Cook · August 6, 2010, 7:38pm

It would be cool to add tagsoup as a mapping type. Would that be the most
appropriate place for this functionality to reside?

-- jim

On Thu, Aug 5, 2010 at 2:27 PM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Hi,

AFAIK Tika is using TagSoup for parsing HTML documents. So you can either
use directly TagSoup or you can try Tika's Swing GUI to test output of your
documents first.
See Apache Tika – Getting Started with Apache Tika where in "Using Tika as
a command line utility" you can find option -g or --gui

But if you have a json document that can contain HTML inside (as a value of
some property) then I think you will have to do this manually. The way
attachment plugin works is that it assumes that whole input document is of
the same content-type, there is no support for parsing documents with nested
content-types.

Regards,
Lukas

On Thu, Aug 5, 2010 at 7:50 PM, James Cook jcook@tracermedia.com wrote:

I see that the attachments plugin uses Tika under the hood to
intelligently index text content embedded in other formats.

We have a situation where we are storing a field in our JSON object that
can contain HTML content. (Think the content of a discussion thread.)

Is there a strategy of technique to make sure the HTML tags are not
indexed? An existing mapping type?

Lukas_Vlcek1 · August 6, 2010, 7:47pm

Wouldn't it be easier to do it on the client side? I think doing it on the
server side can be too complex: consider that if you add HTML as a type to
mapping then you have to specify and handle a lot of other things because
every html has some structure, it has head and body, the head usually
contains a lot of metadata, body contains paragraphs, divs, headlines,
links, ... etc.

Can you give some example how you would like to use it?

Regards,
Lukas

On Fri, Aug 6, 2010 at 9:38 PM, James Cook jcook@tracermedia.com wrote:

It would be cool to add tagsoup as a mapping type. Would that be the most
appropriate place for this functionality to reside?

-- jim

On Thu, Aug 5, 2010 at 2:27 PM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Hi,

AFAIK Tika is using TagSoup for parsing HTML documents. So you can either
use directly TagSoup or you can try Tika's Swing GUI to test output of your
documents first.
See Apache Tika – Getting Started with Apache Tika where in "Using Tika
as a command line utility" you can find option -g or --gui

But if you have a json document that can contain HTML inside (as a value
of some property) then I think you will have to do this manually. The way
attachment plugin works is that it assumes that whole input document is of
the same content-type, there is no support for parsing documents with nested
content-types.

Regards,
Lukas

On Thu, Aug 5, 2010 at 7:50 PM, James Cook jcook@tracermedia.com wrote:

I see that the attachments plugin uses Tika under the hood to
intelligently index text content embedded in other formats.

We have a situation where we are storing a field in our JSON object that
can contain HTML content. (Think the content of a discussion thread.)

Is there a strategy of technique to make sure the HTML tags are not
indexed? An existing mapping type?

Clinton_Gormley · August 8, 2010, 3:00pm

On Thu, 2010-08-05 at 13:50 -0400, James Cook wrote:

I see that the attachments plugin uses Tika under the hood to
intelligently index text content embedded in other formats.

We have a situation where we are storing a field in our JSON object
that can contain HTML content. (Think the content of a discussion
thread.)

Is there a strategy of technique to make sure the HTML tags are not
indexed? An existing mapping type?

I think that this is an important use case, and I've opened this issue:

github.com/elastic/elasticsearch

HTML tokenizer

opened 02:59PM - 08 Aug 10 UTC

closed 09:16PM - 21 Sep 10 UTC

clintongormley

The need to index fields containing HTML is a frequent use case. I think that it… is important to have a built-in tokenizer which can handle HTML (ie remove tags and decode entities). At the moment, in my client code I do the following: ``` $value =~ s/<[^>]+>/ /g; # replace any <.....> extents with a single space $value =~ s/\s+/ /g; # replace multiple spaces with a single space $value =~ s/^ //; # trim leading whitespace $value =~ s/ $//; # trim trailing whitespace decode_entities($value); # translate all HTML entities to the equiv UTF-8 char ``` This is sufficient to convert HTML to text suitable for indexing by the default analyzer - doesn't need to do any more than this. Any chance of getting this built in?

clint

kimchy · August 8, 2010, 9:23pm

Agreed. In any case, I am leaning towards more native "attachment" like
support for specific types, without using Tika. Planning to tackle this post
0.9.1.

On Sun, Aug 8, 2010 at 6:00 PM, Clinton Gormley clinton@iannounce.co.ukwrote:

On Thu, 2010-08-05 at 13:50 -0400, James Cook wrote:

I see that the attachments plugin uses Tika under the hood to
intelligently index text content embedded in other formats.

We have a situation where we are storing a field in our JSON object
that can contain HTML content. (Think the content of a discussion
thread.)

Is there a strategy of technique to make sure the HTML tags are not
indexed? An existing mapping type?

I think that this is an important use case, and I've opened this issue:

HTML tokenizer · Issue #301 · elastic/elasticsearch · GitHub

clint

James_Cook · August 12, 2010, 12:16pm

Thanks for opening the feature request. We could implement this on our
client side, but it would cause a large increase in our storage needs to
store the HTML and also provide a field that can be tokenized containing the
raw text content from that same HTML field.

Is it feasible for someone with little knowledge of ES internals to add a
built-in tokenizer like this? (Not sure if tokenizer is the right
terminology. Is it a mapping or an analyzer?)

On Sun, Aug 8, 2010 at 5:23 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Agreed. In any case, I am leaning towards more native "attachment" like
support for specific types, without using Tika. Planning to tackle this post
0.9.1.

On Sun, Aug 8, 2010 at 6:00 PM, Clinton Gormley clinton@iannounce.co.ukwrote:

On Thu, 2010-08-05 at 13:50 -0400, James Cook wrote:

I see that the attachments plugin uses Tika under the hood to
intelligently index text content embedded in other formats.

We have a situation where we are storing a field in our JSON object
that can contain HTML content. (Think the content of a discussion
thread.)

Is there a strategy of technique to make sure the HTML tags are not
indexed? An existing mapping type?

I think that this is an important use case, and I've opened this issue:

HTML tokenizer · Issue #301 · elastic/elasticsearch · GitHub

clint

kimchy · August 12, 2010, 12:32pm

Do you just want to strip out the html characters, or also, as a result of
the parsing of the html, add properties automatically like title, tags and
so on (on top of the default body level text).

-shay.banon

On Thu, Aug 12, 2010 at 3:16 PM, James Cook jcook@tracermedia.com wrote:

Thanks for opening the feature request. We could implement this on our
client side, but it would cause a large increase in our storage needs to
store the HTML and also provide a field that can be tokenized containing the
raw text content from that same HTML field.

Is it feasible for someone with little knowledge of ES internals to add a
built-in tokenizer like this? (Not sure if tokenizer is the right
terminology. Is it a mapping or an analyzer?)

On Sun, Aug 8, 2010 at 5:23 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Agreed. In any case, I am leaning towards more native "attachment" like
support for specific types, without using Tika. Planning to tackle this post
0.9.1.

On Sun, Aug 8, 2010 at 6:00 PM, Clinton Gormley <clinton@iannounce.co.uk

wrote:

On Thu, 2010-08-05 at 13:50 -0400, James Cook wrote:

I see that the attachments plugin uses Tika under the hood to
intelligently index text content embedded in other formats.

We have a situation where we are storing a field in our JSON object
that can contain HTML content. (Think the content of a discussion
thread.)

Is there a strategy of technique to make sure the HTML tags are not
indexed? An existing mapping type?

I think that this is an important use case, and I've opened this issue:

HTML tokenizer · Issue #301 · elastic/elasticsearch · GitHub

clint

Lukas_Vlcek1 · August 12, 2010, 12:39pm

Hi,

just a note, if the later is required then I think this can get more
complex. Especially when you realize that HTML5 is adding a lot of new (and
useful) stuff: http://diveintohtml5.org/semantics.html#new-elements

http://diveintohtml5.org/semantics.html#new-elementsLukas

On Thu, Aug 12, 2010 at 2:32 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Do you just want to strip out the html characters, or also, as a result of
the parsing of the html, add properties automatically like title, tags and
so on (on top of the default body level text).

-shay.banon

On Thu, Aug 12, 2010 at 3:16 PM, James Cook jcook@tracermedia.com wrote:

Thanks for opening the feature request. We could implement this on our
client side, but it would cause a large increase in our storage needs to
store the HTML and also provide a field that can be tokenized containing the
raw text content from that same HTML field.

Is it feasible for someone with little knowledge of ES internals to add a
built-in tokenizer like this? (Not sure if tokenizer is the right
terminology. Is it a mapping or an analyzer?)

On Sun, Aug 8, 2010 at 5:23 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Agreed. In any case, I am leaning towards more native "attachment" like
support for specific types, without using Tika. Planning to tackle this post
0.9.1.

On Sun, Aug 8, 2010 at 6:00 PM, Clinton Gormley <
clinton@iannounce.co.uk> wrote:

On Thu, 2010-08-05 at 13:50 -0400, James Cook wrote:

I see that the attachments plugin uses Tika under the hood to
intelligently index text content embedded in other formats.

We have a situation where we are storing a field in our JSON object
that can contain HTML content. (Think the content of a discussion
thread.)

Is there a strategy of technique to make sure the HTML tags are not
indexed? An existing mapping type?

I think that this is an important use case, and I've opened this issue:

HTML tokenizer · Issue #301 · elastic/elasticsearch · GitHub

clint

James_Cook · August 12, 2010, 12:41pm

I was hoping for just the stripping of tags. I'm indexing html fragments
that a user creates using a CMS. So they are editing using something like
TinyMCE.

On Thu, Aug 12, 2010 at 8:32 AM, Shay Banon shay.banon@elasticsearch.comwrote:

Do you just want to strip out the html characters, or also, as a result of
the parsing of the html, add properties automatically like title, tags and
so on (on top of the default body level text).

-shay.banon

On Thu, Aug 12, 2010 at 3:16 PM, James Cook jcook@tracermedia.com wrote:

Thanks for opening the feature request. We could implement this on our
client side, but it would cause a large increase in our storage needs to
store the HTML and also provide a field that can be tokenized containing the
raw text content from that same HTML field.

Is it feasible for someone with little knowledge of ES internals to add a
built-in tokenizer like this? (Not sure if tokenizer is the right
terminology. Is it a mapping or an analyzer?)

On Sun, Aug 8, 2010 at 5:23 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Agreed. In any case, I am leaning towards more native "attachment" like
support for specific types, without using Tika. Planning to tackle this post
0.9.1.

On Sun, Aug 8, 2010 at 6:00 PM, Clinton Gormley <
clinton@iannounce.co.uk> wrote:

On Thu, 2010-08-05 at 13:50 -0400, James Cook wrote:

I see that the attachments plugin uses Tika under the hood to
intelligently index text content embedded in other formats.

We have a situation where we are storing a field in our JSON object
that can contain HTML content. (Think the content of a discussion
thread.)

Is there a strategy of technique to make sure the HTML tags are not
indexed? An existing mapping type?

I think that this is an important use case, and I've opened this issue:

HTML tokenizer · Issue #301 · elastic/elasticsearch · GitHub

clint

kimchy · August 12, 2010, 3:23pm

Done: Analysis: Add `char_filter` on top of `tokenizer`, `filter`, and `analyzer`. Add an `html_strip` char filter and `standard_html_strip` analyzer · Issue #315 · elastic/elasticsearch · GitHub. Note,
there is a new "class" of filters, called char_filter. You will need to
create a custom analyzer, with the appropriate tokenizer and filters, and
add the html_strip as a char filter. The standard analyzer, for example, is
composed of standard tokenizer, standard filter, lowercase filter, and stop
filter.

-shay.banon

On Thu, Aug 12, 2010 at 3:41 PM, James Cook jcook@tracermedia.com wrote:

I was hoping for just the stripping of tags. I'm indexing html fragments
that a user creates using a CMS. So they are editing using something like
TinyMCE.

On Thu, Aug 12, 2010 at 8:32 AM, Shay Banon shay.banon@elasticsearch.comwrote:

Do you just want to strip out the html characters, or also, as a result of
the parsing of the html, add properties automatically like title, tags and
so on (on top of the default body level text).

-shay.banon

On Thu, Aug 12, 2010 at 3:16 PM, James Cook jcook@tracermedia.comwrote:

Thanks for opening the feature request. We could implement this on our
client side, but it would cause a large increase in our storage needs to
store the HTML and also provide a field that can be tokenized containing the
raw text content from that same HTML field.

Is it feasible for someone with little knowledge of ES internals to add a
built-in tokenizer like this? (Not sure if tokenizer is the right
terminology. Is it a mapping or an analyzer?)

On Sun, Aug 8, 2010 at 5:23 PM, Shay Banon <shay.banon@elasticsearch.com

wrote:

Agreed. In any case, I am leaning towards more native "attachment" like
support for specific types, without using Tika. Planning to tackle this post
0.9.1.

On Sun, Aug 8, 2010 at 6:00 PM, Clinton Gormley <
clinton@iannounce.co.uk> wrote:

On Thu, 2010-08-05 at 13:50 -0400, James Cook wrote:

I see that the attachments plugin uses Tika under the hood to
intelligently index text content embedded in other formats.

We have a situation where we are storing a field in our JSON object
that can contain HTML content. (Think the content of a discussion
thread.)

Is there a strategy of technique to make sure the HTML tags are not
indexed? An existing mapping type?

I think that this is an important use case, and I've opened this issue:

HTML tokenizer · Issue #301 · elastic/elasticsearch · GitHub

clint

James_Cook · August 13, 2010, 1:10am

Great! Thanks for jumping on this. It looks like you are stripping tags
during streaming so the memory use is very efficient. Very nice.

-- jim

On Thu, Aug 12, 2010 at 11:23 AM, Shay Banon
shay.banon@elasticsearch.comwrote:

Done: Analysis: Add `char_filter` on top of `tokenizer`, `filter`, and `analyzer`. Add an `html_strip` char filter and `standard_html_strip` analyzer · Issue #315 · elastic/elasticsearch · GitHub.
Note, there is a new "class" of filters, called char_filter. You will need
to create a custom analyzer, with the appropriate tokenizer and filters, and
add the html_strip as a char filter. The standard analyzer, for example, is
composed of standard tokenizer, standard filter, lowercase filter, and stop
filter.

-shay.banon

On Thu, Aug 12, 2010 at 3:41 PM, James Cook jcook@tracermedia.com wrote:

I was hoping for just the stripping of tags. I'm indexing html fragments
that a user creates using a CMS. So they are editing using something like
TinyMCE.

On Thu, Aug 12, 2010 at 8:32 AM, Shay Banon <shay.banon@elasticsearch.com

wrote:

Do you just want to strip out the html characters, or also, as a result
of the parsing of the html, add properties automatically like title, tags
and so on (on top of the default body level text).

-shay.banon

On Thu, Aug 12, 2010 at 3:16 PM, James Cook jcook@tracermedia.comwrote:

Thanks for opening the feature request. We could implement this on our
client side, but it would cause a large increase in our storage needs to
store the HTML and also provide a field that can be tokenized containing the
raw text content from that same HTML field.

Is it feasible for someone with little knowledge of ES internals to add
a built-in tokenizer like this? (Not sure if tokenizer is the right
terminology. Is it a mapping or an analyzer?)

On Sun, Aug 8, 2010 at 5:23 PM, Shay Banon <
shay.banon@elasticsearch.com> wrote:

Agreed. In any case, I am leaning towards more native "attachment" like
support for specific types, without using Tika. Planning to tackle this post
0.9.1.

On Sun, Aug 8, 2010 at 6:00 PM, Clinton Gormley <
clinton@iannounce.co.uk> wrote:

On Thu, 2010-08-05 at 13:50 -0400, James Cook wrote:

I see that the attachments plugin uses Tika under the hood to
intelligently index text content embedded in other formats.

We have a situation where we are storing a field in our JSON object
that can contain HTML content. (Think the content of a discussion
thread.)

Is there a strategy of technique to make sure the HTML tags are not
indexed? An existing mapping type?

I think that this is an important use case, and I've opened this
issue:

HTML tokenizer · Issue #301 · elastic/elasticsearch · GitHub

clint

Topic		Replies	Views
Strip_HTML on indexing does not store results? Elasticsearch	10	959	July 6, 2017
Insert hidden metadata in text? Elasticsearch	14	505	July 6, 2017
Help stripping HTML tags Elasticsearch	6	643	July 6, 2017
How to get char_filter to work? Elasticsearch	14	1193	July 6, 2017
Indexing HTML Elasticsearch	5	738	July 6, 2017

Indexing of HTML content

Related topics