Indexing of HTML content

I see that the attachments plugin uses Tika under the hood to intelligently
index text content embedded in other formats.

We have a situation where we are storing a field in our JSON object that can
contain HTML content. (Think the content of a discussion thread.)

Is there a strategy of technique to make sure the HTML tags are not indexed?
An existing mapping type?

Hi,

AFAIK Tika is using TagSoup for parsing HTML documents. So you can either
use directly TagSoup or you can try Tika's Swing GUI to test output of your
documents first.
See Apache Tika – Getting Started with Apache Tika where in "Using Tika as a
command line utility" you can find option -g or --gui

But if you have a json document that can contain HTML inside (as a value of
some property) then I think you will have to do this manually. The way
attachment plugin works is that it assumes that whole input document is of
the same content-type, there is no support for parsing documents with nested
content-types.

Regards,
Lukas

On Thu, Aug 5, 2010 at 7:50 PM, James Cook jcook@tracermedia.com wrote:

I see that the attachments plugin uses Tika under the hood to intelligently
index text content embedded in other formats.

We have a situation where we are storing a field in our JSON object that
can contain HTML content. (Think the content of a discussion thread.)

Is there a strategy of technique to make sure the HTML tags are not
indexed? An existing mapping type?

It would be cool to add tagsoup as a mapping type. Would that be the most
appropriate place for this functionality to reside?

-- jim

On Thu, Aug 5, 2010 at 2:27 PM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Hi,

AFAIK Tika is using TagSoup for parsing HTML documents. So you can either
use directly TagSoup or you can try Tika's Swing GUI to test output of your
documents first.
See Apache Tika – Getting Started with Apache Tika where in "Using Tika as
a command line utility" you can find option -g or --gui

But if you have a json document that can contain HTML inside (as a value of
some property) then I think you will have to do this manually. The way
attachment plugin works is that it assumes that whole input document is of
the same content-type, there is no support for parsing documents with nested
content-types.

Regards,
Lukas

On Thu, Aug 5, 2010 at 7:50 PM, James Cook jcook@tracermedia.com wrote:

I see that the attachments plugin uses Tika under the hood to
intelligently index text content embedded in other formats.

We have a situation where we are storing a field in our JSON object that
can contain HTML content. (Think the content of a discussion thread.)

Is there a strategy of technique to make sure the HTML tags are not
indexed? An existing mapping type?

Wouldn't it be easier to do it on the client side? I think doing it on the
server side can be too complex: consider that if you add HTML as a type to
mapping then you have to specify and handle a lot of other things because
every html has some structure, it has head and body, the head usually
contains a lot of metadata, body contains paragraphs, divs, headlines,
links, ... etc.

Can you give some example how you would like to use it?

Regards,
Lukas

On Fri, Aug 6, 2010 at 9:38 PM, James Cook jcook@tracermedia.com wrote:

It would be cool to add tagsoup as a mapping type. Would that be the most
appropriate place for this functionality to reside?

-- jim

On Thu, Aug 5, 2010 at 2:27 PM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Hi,

AFAIK Tika is using TagSoup for parsing HTML documents. So you can either
use directly TagSoup or you can try Tika's Swing GUI to test output of your
documents first.
See Apache Tika – Getting Started with Apache Tika where in "Using Tika
as a command line utility" you can find option -g or --gui

But if you have a json document that can contain HTML inside (as a value
of some property) then I think you will have to do this manually. The way
attachment plugin works is that it assumes that whole input document is of
the same content-type, there is no support for parsing documents with nested
content-types.

Regards,
Lukas

On Thu, Aug 5, 2010 at 7:50 PM, James Cook jcook@tracermedia.com wrote:

I see that the attachments plugin uses Tika under the hood to
intelligently index text content embedded in other formats.

We have a situation where we are storing a field in our JSON object that
can contain HTML content. (Think the content of a discussion thread.)

Is there a strategy of technique to make sure the HTML tags are not
indexed? An existing mapping type?

On Thu, 2010-08-05 at 13:50 -0400, James Cook wrote:

I see that the attachments plugin uses Tika under the hood to
intelligently index text content embedded in other formats.

We have a situation where we are storing a field in our JSON object
that can contain HTML content. (Think the content of a discussion
thread.)

Is there a strategy of technique to make sure the HTML tags are not
indexed? An existing mapping type?

I think that this is an important use case, and I've opened this issue:

clint

Agreed. In any case, I am leaning towards more native "attachment" like
support for specific types, without using Tika. Planning to tackle this post
0.9.1.

On Sun, Aug 8, 2010 at 6:00 PM, Clinton Gormley clinton@iannounce.co.ukwrote:

On Thu, 2010-08-05 at 13:50 -0400, James Cook wrote:

I see that the attachments plugin uses Tika under the hood to
intelligently index text content embedded in other formats.

We have a situation where we are storing a field in our JSON object
that can contain HTML content. (Think the content of a discussion
thread.)

Is there a strategy of technique to make sure the HTML tags are not
indexed? An existing mapping type?

I think that this is an important use case, and I've opened this issue:

HTML tokenizer · Issue #301 · elastic/elasticsearch · GitHub

clint

Thanks for opening the feature request. We could implement this on our
client side, but it would cause a large increase in our storage needs to
store the HTML and also provide a field that can be tokenized containing the
raw text content from that same HTML field.

Is it feasible for someone with little knowledge of ES internals to add a
built-in tokenizer like this? (Not sure if tokenizer is the right
terminology. Is it a mapping or an analyzer?)

On Sun, Aug 8, 2010 at 5:23 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Agreed. In any case, I am leaning towards more native "attachment" like
support for specific types, without using Tika. Planning to tackle this post
0.9.1.

On Sun, Aug 8, 2010 at 6:00 PM, Clinton Gormley clinton@iannounce.co.ukwrote:

On Thu, 2010-08-05 at 13:50 -0400, James Cook wrote:

I see that the attachments plugin uses Tika under the hood to
intelligently index text content embedded in other formats.

We have a situation where we are storing a field in our JSON object
that can contain HTML content. (Think the content of a discussion
thread.)

Is there a strategy of technique to make sure the HTML tags are not
indexed? An existing mapping type?

I think that this is an important use case, and I've opened this issue:

HTML tokenizer · Issue #301 · elastic/elasticsearch · GitHub

clint

Do you just want to strip out the html characters, or also, as a result of
the parsing of the html, add properties automatically like title, tags and
so on (on top of the default body level text).

-shay.banon

On Thu, Aug 12, 2010 at 3:16 PM, James Cook jcook@tracermedia.com wrote:

Thanks for opening the feature request. We could implement this on our
client side, but it would cause a large increase in our storage needs to
store the HTML and also provide a field that can be tokenized containing the
raw text content from that same HTML field.

Is it feasible for someone with little knowledge of ES internals to add a
built-in tokenizer like this? (Not sure if tokenizer is the right
terminology. Is it a mapping or an analyzer?)

On Sun, Aug 8, 2010 at 5:23 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Agreed. In any case, I am leaning towards more native "attachment" like
support for specific types, without using Tika. Planning to tackle this post
0.9.1.

On Sun, Aug 8, 2010 at 6:00 PM, Clinton Gormley <clinton@iannounce.co.uk

wrote:

On Thu, 2010-08-05 at 13:50 -0400, James Cook wrote:

I see that the attachments plugin uses Tika under the hood to
intelligently index text content embedded in other formats.

We have a situation where we are storing a field in our JSON object
that can contain HTML content. (Think the content of a discussion
thread.)

Is there a strategy of technique to make sure the HTML tags are not
indexed? An existing mapping type?

I think that this is an important use case, and I've opened this issue:

HTML tokenizer · Issue #301 · elastic/elasticsearch · GitHub

clint

Hi,

just a note, if the later is required then I think this can get more
complex. Especially when you realize that HTML5 is adding a lot of new (and
useful) stuff: http://diveintohtml5.org/semantics.html#new-elements

http://diveintohtml5.org/semantics.html#new-elementsLukas

On Thu, Aug 12, 2010 at 2:32 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Do you just want to strip out the html characters, or also, as a result of
the parsing of the html, add properties automatically like title, tags and
so on (on top of the default body level text).

-shay.banon

On Thu, Aug 12, 2010 at 3:16 PM, James Cook jcook@tracermedia.com wrote:

Thanks for opening the feature request. We could implement this on our
client side, but it would cause a large increase in our storage needs to
store the HTML and also provide a field that can be tokenized containing the
raw text content from that same HTML field.

Is it feasible for someone with little knowledge of ES internals to add a
built-in tokenizer like this? (Not sure if tokenizer is the right
terminology. Is it a mapping or an analyzer?)

On Sun, Aug 8, 2010 at 5:23 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Agreed. In any case, I am leaning towards more native "attachment" like
support for specific types, without using Tika. Planning to tackle this post
0.9.1.

On Sun, Aug 8, 2010 at 6:00 PM, Clinton Gormley <
clinton@iannounce.co.uk> wrote:

On Thu, 2010-08-05 at 13:50 -0400, James Cook wrote:

I see that the attachments plugin uses Tika under the hood to
intelligently index text content embedded in other formats.

We have a situation where we are storing a field in our JSON object
that can contain HTML content. (Think the content of a discussion
thread.)

Is there a strategy of technique to make sure the HTML tags are not
indexed? An existing mapping type?

I think that this is an important use case, and I've opened this issue:

HTML tokenizer · Issue #301 · elastic/elasticsearch · GitHub

clint

I was hoping for just the stripping of tags. I'm indexing html fragments
that a user creates using a CMS. So they are editing using something like
TinyMCE.

On Thu, Aug 12, 2010 at 8:32 AM, Shay Banon shay.banon@elasticsearch.comwrote:

Do you just want to strip out the html characters, or also, as a result of
the parsing of the html, add properties automatically like title, tags and
so on (on top of the default body level text).

-shay.banon

On Thu, Aug 12, 2010 at 3:16 PM, James Cook jcook@tracermedia.com wrote:

Thanks for opening the feature request. We could implement this on our
client side, but it would cause a large increase in our storage needs to
store the HTML and also provide a field that can be tokenized containing the
raw text content from that same HTML field.

Is it feasible for someone with little knowledge of ES internals to add a
built-in tokenizer like this? (Not sure if tokenizer is the right
terminology. Is it a mapping or an analyzer?)

On Sun, Aug 8, 2010 at 5:23 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Agreed. In any case, I am leaning towards more native "attachment" like
support for specific types, without using Tika. Planning to tackle this post
0.9.1.

On Sun, Aug 8, 2010 at 6:00 PM, Clinton Gormley <
clinton@iannounce.co.uk> wrote:

On Thu, 2010-08-05 at 13:50 -0400, James Cook wrote:

I see that the attachments plugin uses Tika under the hood to
intelligently index text content embedded in other formats.

We have a situation where we are storing a field in our JSON object
that can contain HTML content. (Think the content of a discussion
thread.)

Is there a strategy of technique to make sure the HTML tags are not
indexed? An existing mapping type?

I think that this is an important use case, and I've opened this issue:

HTML tokenizer · Issue #301 · elastic/elasticsearch · GitHub

clint

Done: Analysis: Add `char_filter` on top of `tokenizer`, `filter`, and `analyzer`. Add an `html_strip` char filter and `standard_html_strip` analyzer · Issue #315 · elastic/elasticsearch · GitHub. Note,
there is a new "class" of filters, called char_filter. You will need to
create a custom analyzer, with the appropriate tokenizer and filters, and
add the html_strip as a char filter. The standard analyzer, for example, is
composed of standard tokenizer, standard filter, lowercase filter, and stop
filter.

-shay.banon

On Thu, Aug 12, 2010 at 3:41 PM, James Cook jcook@tracermedia.com wrote:

I was hoping for just the stripping of tags. I'm indexing html fragments
that a user creates using a CMS. So they are editing using something like
TinyMCE.

On Thu, Aug 12, 2010 at 8:32 AM, Shay Banon shay.banon@elasticsearch.comwrote:

Do you just want to strip out the html characters, or also, as a result of
the parsing of the html, add properties automatically like title, tags and
so on (on top of the default body level text).

-shay.banon

On Thu, Aug 12, 2010 at 3:16 PM, James Cook jcook@tracermedia.comwrote:

Thanks for opening the feature request. We could implement this on our
client side, but it would cause a large increase in our storage needs to
store the HTML and also provide a field that can be tokenized containing the
raw text content from that same HTML field.

Is it feasible for someone with little knowledge of ES internals to add a
built-in tokenizer like this? (Not sure if tokenizer is the right
terminology. Is it a mapping or an analyzer?)

On Sun, Aug 8, 2010 at 5:23 PM, Shay Banon <shay.banon@elasticsearch.com

wrote:

Agreed. In any case, I am leaning towards more native "attachment" like
support for specific types, without using Tika. Planning to tackle this post
0.9.1.

On Sun, Aug 8, 2010 at 6:00 PM, Clinton Gormley <
clinton@iannounce.co.uk> wrote:

On Thu, 2010-08-05 at 13:50 -0400, James Cook wrote:

I see that the attachments plugin uses Tika under the hood to
intelligently index text content embedded in other formats.

We have a situation where we are storing a field in our JSON object
that can contain HTML content. (Think the content of a discussion
thread.)

Is there a strategy of technique to make sure the HTML tags are not
indexed? An existing mapping type?

I think that this is an important use case, and I've opened this issue:

HTML tokenizer · Issue #301 · elastic/elasticsearch · GitHub

clint

Great! Thanks for jumping on this. It looks like you are stripping tags
during streaming so the memory use is very efficient. Very nice.

-- jim

On Thu, Aug 12, 2010 at 11:23 AM, Shay Banon
shay.banon@elasticsearch.comwrote:

Done: Analysis: Add `char_filter` on top of `tokenizer`, `filter`, and `analyzer`. Add an `html_strip` char filter and `standard_html_strip` analyzer · Issue #315 · elastic/elasticsearch · GitHub.
Note, there is a new "class" of filters, called char_filter. You will need
to create a custom analyzer, with the appropriate tokenizer and filters, and
add the html_strip as a char filter. The standard analyzer, for example, is
composed of standard tokenizer, standard filter, lowercase filter, and stop
filter.

-shay.banon

On Thu, Aug 12, 2010 at 3:41 PM, James Cook jcook@tracermedia.com wrote:

I was hoping for just the stripping of tags. I'm indexing html fragments
that a user creates using a CMS. So they are editing using something like
TinyMCE.

On Thu, Aug 12, 2010 at 8:32 AM, Shay Banon <shay.banon@elasticsearch.com

wrote:

Do you just want to strip out the html characters, or also, as a result
of the parsing of the html, add properties automatically like title, tags
and so on (on top of the default body level text).

-shay.banon

On Thu, Aug 12, 2010 at 3:16 PM, James Cook jcook@tracermedia.comwrote:

Thanks for opening the feature request. We could implement this on our
client side, but it would cause a large increase in our storage needs to
store the HTML and also provide a field that can be tokenized containing the
raw text content from that same HTML field.

Is it feasible for someone with little knowledge of ES internals to add
a built-in tokenizer like this? (Not sure if tokenizer is the right
terminology. Is it a mapping or an analyzer?)

On Sun, Aug 8, 2010 at 5:23 PM, Shay Banon <
shay.banon@elasticsearch.com> wrote:

Agreed. In any case, I am leaning towards more native "attachment" like
support for specific types, without using Tika. Planning to tackle this post
0.9.1.

On Sun, Aug 8, 2010 at 6:00 PM, Clinton Gormley <
clinton@iannounce.co.uk> wrote:

On Thu, 2010-08-05 at 13:50 -0400, James Cook wrote:

I see that the attachments plugin uses Tika under the hood to
intelligently index text content embedded in other formats.

We have a situation where we are storing a field in our JSON object
that can contain HTML content. (Think the content of a discussion
thread.)

Is there a strategy of technique to make sure the HTML tags are not
indexed? An existing mapping type?

I think that this is an important use case, and I've opened this
issue:

HTML tokenizer · Issue #301 · elastic/elasticsearch · GitHub

clint