I use the standard tokenizer and I don't use the html_strip char filter.
How can I index HTML tags?
In fact, I want to be able to search with and without the < and > characters. I.e. a search for <section> should match This is about the <section> tag, but it should not match In this section we talks about stuff. The standard tokenizer will turn that (search) text to ["section"].
As a bonus, if this can be done I don't have to worry about the stop char filter turning <a> into [].
This seems to work. The tokens from <a> now becomes: ["_a_"] and for <section> it becomes ["_section_", "section"] which probably because of how I use my analyzer in conjunction with this char filter.
Still eager to hear some expert advice for the "proper" way to do this.
But that didn't work. Because now the tokens become a and section etc. (instead of being removed) but when passed into the token filters the stop filter removes a.
That's great. A search for <section> will match This is about the <section> tag but it will not match In this section we talks about stuff
And the <a> gets turned into htmlahtml which means it's not treated as a alone which would become a stopword.
Would still appreciate an experts advice. But otherwise happy to close this as resolved.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.