How can I index HTML tags

I use the standard tokenizer and I don't use the html_strip char filter.
How can I index HTML tags?

In fact, I want to be able to search with and without the < and > characters. I.e. a search for <section> should match This is about the <section> tag, but it should not match In this section we talks about stuff. The standard tokenizer will turn that (search) text to ["section"].

As a bonus, if this can be done I don't have to worry about the stop char filter turning <a> into [].

I've found one hacky way to get it to work.

keep_html_char_filter = char_filter(
    "keep_html_char_filter",
    type="mapping",
    mappings=[
        "<a> => _a_",
        "<i> => _i_",
        "<b> => _b_",
        "<section> => _section_",
    ],
)

This seems to work. The tokens from <a> now becomes: ["_a_"] and for <section> it becomes ["_section_", "section"] which probably because of how I use my analyzer in conjunction with this char filter.

Still eager to hear some expert advice for the "proper" way to do this.

I also tried:

keep_html_char_filter = char_filter(
    "keep_html_char_filter",
    type="html_strip",
    escaped_tags=["a", "b", "section", "i"],
)

But that didn't work. Because now the tokens become a and section etc. (instead of being removed) but when passed into the token filters the stop filter removes a.

I figured it out!!

keep_html_char_filter = char_filter(
    "keep_html_char_filter",
    type="pattern_replace",
    pattern="<(\\w+)>",
    replacement="html$1html",
)

Now, if the text is <b> <a> <i> <script> I get the following tokens:

'htmlbhtml', 'htmlahtml', 'htmlihtml', 'htmlscripthtml'

That's great. A search for <section> will match This is about the <section> tag but it will not match In this section we talks about stuff
And the <a> gets turned into htmlahtml which means it's not treated as a alone which would become a stopword.

Would still appreciate an experts advice. But otherwise happy to close this as resolved.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.