Tokenizing HTML

David_James · August 6, 2012, 3:42pm

Converting to plain text before tokenizing can be a problem if it loses the
natural delimiters that HTML tags provide. For example:

New
York

City of Chicago

should not be tokenized as New York City,
of, Chicago but that could happen it you convert it to New York City of
Chicago first.

Converting tag breaks to \n would probably help, but you have to be
careful: do you do it with

? Probably. ? Probably not.

Is there already an analyzer that does the above or close to it?

My application is in Ruby -- I'm happy (prefer actually) to write my custom
HTML parser in Ruby with Nokogiri. If I do that, what approaches should I
consider when I connect to ES?

Thanks!

vineeth_mohan · August 6, 2012, 3:51pm

You can use html_strip character filter -

This way you can preserve the html and at the same time make only stripped
text search-able.

Thanks
Vineeth

On Mon, Aug 6, 2012 at 9:12 PM, David James davidcjames@gmail.com wrote:

Converting to plain text before tokenizing can be a problem if it loses
the natural delimiters that HTML tags provide. For example:
New
York
City of Chicago
should not be tokenized as New York City
, of, Chicago but that could happen it you convert it to New York City of
Chicago first.

Converting tag breaks to \n would probably help, but you have to be
careful: do you do it with
? Probably. ? Probably not.

Is there already an analyzer that does the above or close to it?

My application is in Ruby -- I'm happy (prefer actually) to write my
custom HTML parser in Ruby with Nokogiri. If I do that, what approaches
should I consider when I connect to ES?

Thanks!

David_James · August 6, 2012, 6:55pm

A filter can only run after a tokenizer. Right? (This is what reading
http://www.manning.com/hatcher3/ and
http://www.elasticsearch.org/guide/reference/index-modules/analysis/ leads
me to believe.) So, if the HTMLStripCharFilter just removes tags, it cannot
possibly adjust the tokenization. Right? Therefore, it could not possibly
handle the situation I described above:

New York

City of
Chicago

. Right?

Only stripping tags -- without thinking about tokenization as well -- is
naive in my opinion.

I'm new to ElasticSearch and Lucene, but I've read quite a bit. I would be
surprised if the situation I described above is not handled properly...
somewhere.

P.S. I just did a code skim of HTMLStripCharFilter and it scared the crap
out of me in the "I didn't know Java could be this bad" kind of way. Then I
realized it was generated by JFlex.

Clinton_Gormley · August 6, 2012, 8:44pm

On Mon, 2012-08-06 at 11:55 -0700, David James wrote:

A filter can only run after a tokenizer.

A token filter runs after a tokenizer. But it can transform the tokens
which are finally indexed.

That said, this is a character filter, not a token filter, and it runs
before the tokenizer.

clint

Right? (This is what reading Lucene in Action, Second Edition and
Elasticsearch Platform — Find real-time answers at scale | Elastic
leads me to believe.) So, if the HTMLStripCharFilter just removes
tags, it cannot possibly adjust the tokenization. Right? Therefore, it
could not possibly handle the situation I described above:
New
York
City of Chicago
. Right?

Only stripping tags -- without thinking about tokenization as well --
is naive in my opinion.

I'm new to Elasticsearch and Lucene, but I've read quite a bit. I
would be surprised if the situation I described above is not
handled properly... somewhere.

P.S. I just did a code skim of HTMLStripCharFilter and it scared the
crap out of me in the "I didn't know Java could be this bad" kind of
way. Then I realized it was generated by JFlex.

David_James · August 7, 2012, 4:06pm

Thanks for clearing that up!

I am currently using Nokogiri (a Ruby library) to parse the HTML and remove
the tags "intelligently". I wrote some custom logic that:

liberally adds newlines ("\n") in the resulting plain text where
appropriate, so that
New York
<City of Chicago

gloms together "inline" type tags so that
I want to go
fast
becomes I want to go fast\n\n

Downstream (in Elasticsearch), I can pay attention to the double newline to
help tokenization.

It does this by paying attention to three kinds of HTML elements:

elements that separate tokens (e.g. article aside blockquote body
button canvas caption center col colgroup dd dir div dl dt embed fieldset
figcaption figure footer form h1 h2 h3 h4 h5 h6 header hgroup hr iframe
label legend li menu nav noscript ol optgroup option output p pre progress
samp section select table tbody td textarea tfoot th thead title tr ul video
)
elements that do not separate tokens (e.g. a abbr acronym address b
bdo big br cite code dfn em font i input ins kbd nobr q script small span
strong style time tt u var wbr)
elements that should be skipped (e.g. applet area base basefont del
img link map meta object param s strike sub sup) (Note that this ignores
strikethrough text.)

These lists roughly correspond to block-level and inline elements. That
said, I tweaked them based on documents I'm seeing. So, the lists are
subjective. They do not understand changes effected by CSS -- such as
changing a div to be an inline element (that is so evil!).

This is not fancy, but I wanted to share in case it is helpful and/or crazy.

-David

Topic		Replies	Views
Tokenizer for html tags attributes Elasticsearch	1	341	July 6, 2017
How can I index HTML tags Elasticsearch	4	512	March 22, 2021
Help stripping HTML tags Elasticsearch	6	592	July 6, 2017
Storing the html stripped version of a document in elasticsearch Elasticsearch	4	3669	September 26, 2017
Simple question about html stripping Elasticsearch	4	375	July 6, 2017

Tokenizing HTML

Related topics