How can I index HTML tags

peterbe · February 22, 2021, 2:28pm

I use the standard tokenizer and I don't use the html_strip char filter.
How can I index HTML tags?

In fact, I want to be able to search with and without the < and > characters. I.e. a search for <section> should match This is about the <section> tag, but it should not match In this section we talks about stuff. The standard tokenizer will turn that (search) text to ["section"].

As a bonus, if this can be done I don't have to worry about the stop char filter turning <a> into [].

peterbe · February 22, 2021, 2:54pm

I've found one hacky way to get it to work.

keep_html_char_filter = char_filter(
    "keep_html_char_filter",
    type="mapping",
    mappings=[
        "<a> => _a_",
        "<i> => _i_",
        "<b> => _b_",
        "<section> => _section_",
    ],
)

This seems to work. The tokens from <a> now becomes: ["_a_"] and for <section> it becomes ["_section_", "section"] which probably because of how I use my analyzer in conjunction with this char filter.

Still eager to hear some expert advice for the "proper" way to do this.

peterbe · February 22, 2021, 3:02pm

I also tried:

keep_html_char_filter = char_filter(
    "keep_html_char_filter",
    type="html_strip",
    escaped_tags=["a", "b", "section", "i"],
)

But that didn't work. Because now the tokens become a and section etc. (instead of being removed) but when passed into the token filters the stop filter removes a.

peterbe · February 22, 2021, 3:09pm

I figured it out!!

keep_html_char_filter = char_filter(
    "keep_html_char_filter",
    type="pattern_replace",
    pattern="<(\\w+)>",
    replacement="html$1html",
)

Now, if the text is <b> <a> <i> <script> I get the following tokens:

'htmlbhtml', 'htmlahtml', 'htmlihtml', 'htmlscripthtml'

That's great. A search for <section> will match This is about the <section> tag but it will not match In this section we talks about stuff
And the <a> gets turned into htmlahtml which means it's not treated as a alone which would become a stopword.

Would still appreciate an experts advice. But otherwise happy to close this as resolved.

system · March 22, 2021, 3:09pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to get char_filter to work? Elasticsearch	14	1144	July 6, 2017
HTML Filter - How do I use it in a search? Elasticsearch	5	567	March 16, 2018
Storing the html stripped version of a document in elasticsearch Elasticsearch	4	3648	September 26, 2017
How to use html_strip Char filter? Elasticsearch	5	1833	July 6, 2017
Highlight fragments of fields that use the html_strip char filter still contain HTML tags Elasticsearch	4	18	August 27, 2024

How can I index HTML tags

Related topics