Converting to plain text before tokenizing can be a problem if it loses the
natural delimiters that HTML tags provide. For example:
New
York
City of Chicago
should not be tokenized as New York City,
of, Chicago but that could happen it you convert it to New York City of
Chicago first.
Converting tag breaks to \n would probably help, but you have to be
careful: do you do it with
? Probably. ? Probably not.
Is there already an analyzer that does the above or close to it?
My application is in Ruby -- I'm happy (prefer actually) to write my custom
HTML parser in Ruby with Nokogiri. If I do that, what approaches should I
consider when I connect to ES?
Converting to plain text before tokenizing can be a problem if it loses
the natural delimiters that HTML tags provide. For example:
New
York
City of Chicago
should not be tokenized as New York City
, of, Chicago but that could happen it you convert it to New York City of
Chicago first.
Converting tag breaks to \n would probably help, but you have to be
careful: do you do it with
? Probably. ? Probably not.
Is there already an analyzer that does the above or close to it?
My application is in Ruby -- I'm happy (prefer actually) to write my
custom HTML parser in Ruby with Nokogiri. If I do that, what approaches
should I consider when I connect to ES?
Only stripping tags -- without thinking about tokenization as well -- is
naive in my opinion.
I'm new to ElasticSearch and Lucene, but I've read quite a bit. I would be
surprised if the situation I described above is not handled properly...
somewhere.
P.S. I just did a code skim of HTMLStripCharFilter and it scared the crap
out of me in the "I didn't know Java could be this bad" kind of way. Then I
realized it was generated by JFlex.
Only stripping tags -- without thinking about tokenization as well --
is naive in my opinion.
I'm new to Elasticsearch and Lucene, but I've read quite a bit. I
would be surprised if the situation I described above is not
handled properly... somewhere.
P.S. I just did a code skim of HTMLStripCharFilter and it scared the
crap out of me in the "I didn't know Java could be this bad" kind of
way. Then I realized it was generated by JFlex.
I am currently using Nokogiri (a Ruby library) to parse the HTML and remove
the tags "intelligently". I wrote some custom logic that:
liberally adds newlines ("\n") in the resulting plain text where
appropriate, so that
New York
<City of Chicago
becomes New
York\n\nCity of Chicago\n\n
gloms together "inline" type tags so that
I want to go fast
becomes I want to go fast\n\n
Downstream (in Elasticsearch), I can pay attention to the double newline to
help tokenization.
It does this by paying attention to three kinds of HTML elements:
elements that separate tokens (e.g. article aside blockquote body
button canvas caption center col colgroup dd dir div dl dt embed fieldset
figcaption figure footer form h1 h2 h3 h4 h5 h6 header hgroup hr iframe
label legend li menu nav noscript ol optgroup option output p pre progress
samp section select table tbody td textarea tfoot th thead title tr ul video
)
elements that do not separate tokens (e.g. a abbr acronym address b
bdo big br cite code dfn em font i input ins kbd nobr q script small span
strong style time tt u var wbr)
elements that should be skipped (e.g. applet area base basefont del
img link map meta object param s strike sub sup) (Note that this ignores
strikethrough text.)
These lists roughly correspond to block-level and inline elements. That
said, I tweaked them based on documents I'm seeing. So, the lists are
subjective. They do not understand changes effected by CSS -- such as
changing a div to be an inline element (that is so evil!).
This is not fancy, but I wanted to share in case it is helpful and/or crazy.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.