Converting to plain text before tokenizing can be a problem if it loses the
natural delimiters that HTML tags provide. For example:
of, Chicago but that could happen it you convert it to New York City of
Converting tag breaks to \n would probably help, but you have to be
careful: do you do it with
? Probably. ? Probably not.
Is there already an analyzer that does the above or close to it?
My application is in Ruby -- I'm happy (prefer actually) to write my custom
HTML parser in Ruby with Nokogiri. If I do that, what approaches should I
consider when I connect to ES?