Tokenizing HTML

Converting to plain text before tokenizing can be a problem if it loses the
natural delimiters that HTML tags provide. For example:

  • New
    York
  • City of Chicago
  • should not be tokenized as New York City,
    of, Chicago but that could happen it you convert it to New York City of
    Chicago first.

    Converting tag breaks to \n would probably help, but you have to be
    careful: do you do it with

    ? Probably. ? Probably not.

    Is there already an analyzer that does the above or close to it?

    My application is in Ruby -- I'm happy (prefer actually) to write my custom
    HTML parser in Ruby with Nokogiri. If I do that, what approaches should I
    consider when I connect to ES?

    Thanks!

    You can use html_strip character filter -

    This way you can preserve the html and at the same time make only stripped
    text search-able.

    Thanks
    Vineeth

    On Mon, Aug 6, 2012 at 9:12 PM, David James davidcjames@gmail.com wrote:

    Converting to plain text before tokenizing can be a problem if it loses
    the natural delimiters that HTML tags provide. For example:

  • New
    York
  • City of Chicago
  • should not be tokenized as New York City
    , of, Chicago but that could happen it you convert it to New York City of
    Chicago first.

    Converting tag breaks to \n would probably help, but you have to be
    careful: do you do it with

    ? Probably. ? Probably not.

    Is there already an analyzer that does the above or close to it?

    My application is in Ruby -- I'm happy (prefer actually) to write my
    custom HTML parser in Ruby with Nokogiri. If I do that, what approaches
    should I consider when I connect to ES?

    Thanks!

    A filter can only run after a tokenizer. Right? (This is what reading
    http://www.manning.com/hatcher3/ and
    http://www.elasticsearch.org/guide/reference/index-modules/analysis/ leads
    me to believe.) So, if the HTMLStripCharFilter just removes tags, it cannot
    possibly adjust the tokenization. Right? Therefore, it could not possibly
    handle the situation I described above:

  • New York
  • City of
    Chicago
  • . Right?

    Only stripping tags -- without thinking about tokenization as well -- is
    naive in my opinion.

    I'm new to ElasticSearch and Lucene, but I've read quite a bit. I would be
    surprised if the situation I described above is not handled properly...
    somewhere.

    P.S. I just did a code skim of HTMLStripCharFilter and it scared the crap
    out of me in the "I didn't know Java could be this bad" kind of way. Then I
    realized it was generated by JFlex.

    On Mon, 2012-08-06 at 11:55 -0700, David James wrote:

    A filter can only run after a tokenizer.

    A token filter runs after a tokenizer. But it can transform the tokens
    which are finally indexed.

    That said, this is a character filter, not a token filter, and it runs
    before the tokenizer.

    clint

    Right? (This is what reading Lucene in Action, Second Edition and
    Elasticsearch Platform — Find real-time answers at scale | Elastic
    leads me to believe.) So, if the HTMLStripCharFilter just removes
    tags, it cannot possibly adjust the tokenization. Right? Therefore, it
    could not possibly handle the situation I described above:

  • New
    York
  • City of Chicago
  • . Right?

    Only stripping tags -- without thinking about tokenization as well --
    is naive in my opinion.

    I'm new to Elasticsearch and Lucene, but I've read quite a bit. I
    would be surprised if the situation I described above is not
    handled properly... somewhere.

    P.S. I just did a code skim of HTMLStripCharFilter and it scared the
    crap out of me in the "I didn't know Java could be this bad" kind of
    way. Then I realized it was generated by JFlex.

    Thanks for clearing that up!

    I am currently using Nokogiri (a Ruby library) to parse the HTML and remove
    the tags "intelligently". I wrote some custom logic that:

    • liberally adds newlines ("\n") in the resulting plain text where
      appropriate, so that
    • New York
    • <City of Chicago
    • becomes New
      York\n\nCity of Chicago\n\n
    • gloms together "inline" type tags so that

      I want to go
      fast

      becomes I want to go fast\n\n

    Downstream (in Elasticsearch), I can pay attention to the double newline to
    help tokenization.

    It does this by paying attention to three kinds of HTML elements:

    1. elements that separate tokens (e.g. article aside blockquote body
      button canvas caption center col colgroup dd dir div dl dt embed fieldset
      figcaption figure footer form h1 h2 h3 h4 h5 h6 header hgroup hr iframe
      label legend li menu nav noscript ol optgroup option output p pre progress
      samp section select table tbody td textarea tfoot th thead title tr ul video
      )
    2. elements that do not separate tokens (e.g. a abbr acronym address b
      bdo big br cite code dfn em font i input ins kbd nobr q script small span
      strong style time tt u var wbr)
    3. elements that should be skipped (e.g. applet area base basefont del
      img link map meta object param s strike sub sup) (Note that this ignores
      strikethrough text.)

    These lists roughly correspond to block-level and inline elements. That
    said, I tweaked them based on documents I'm seeing. So, the lists are
    subjective. They do not understand changes effected by CSS -- such as
    changing a div to be an inline element (that is so evil!).

    This is not fancy, but I wanted to share in case it is helpful and/or crazy.

    -David