Extend built-in analyzers

yansal · June 6, 2018, 10:01am

My use case is to add the html_strip char_filter to an existing language analyzer.

For example, I would like to create an index like this one:

PUT /myindex
{
    "settings": {
        "analysis": {
            "analyzer": {
                "english_html_strip": {
                    "type": "english",
                    "char_filter": [
                        "html_strip"
                    ]
                }
            }
        }
    },
    "mappings": {
        "_doc": {
            "properties": {
                "description_english_html": {
                    "type": "text",
                    "analyzer": "english_html_strip"
                }
            }
        }
    }
}

I then expect the following _analyze request to both use the english analyzer and to strip html tags.

POST /myindex/_analyze
{
    "field": "description_english_html",
    "text": "<h1>A header</h1><p>A paragraph</p>"
}

However it doesn't strip html tags, see the output:

{
    "tokens": [
        {
            "token": "h1",
            "start_offset": 1,
            "end_offset": 3,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "header",
            "start_offset": 6,
            "end_offset": 12,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "h1",
            "start_offset": 14,
            "end_offset": 16,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "p",
            "start_offset": 18,
            "end_offset": 19,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "paragraph",
            "start_offset": 22,
            "end_offset": 31,
            "type": "<ALPHANUM>",
            "position": 6
        },
        {
            "token": "p",
            "start_offset": 33,
            "end_offset": 34,
            "type": "<ALPHANUM>",
            "position": 7
        }
    ]
}

Is there a solution, besides rebuilding the language analyzer from scratch?

dadoonet · June 6, 2018, 10:14am

I'd follow this: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html#english-analyzer

tdasch · June 6, 2018, 4:02pm

I have no idea if this is correct, but it appears to work. I feel like i've gotten it wrong but maybe it can help you reach the right answer. The documentation was confusing on adding a character filter to a language analyzer.

> PUT test23
> {
>  "settings": {
>    "analysis": {
>      "analyzer": {
>        "english": {
>          "type": "custom",
>          "char_filter": ["html_strip"],
>          "tokenizer": "standard"
>        }
>      }
>    }
>  } 
> }


> POST test23/_analyze
> {
>   "analyzer": "english",
>   "text": "<h1>A header</h1><p>A paragraph</p>"
> }

> {
>   "tokens": [
>     {
>       "token": "A",
>       "start_offset": 4,
>       "end_offset": 5,
>       "type": "<ALPHANUM>",
>       "position": 0
>     },
>     {
>       "token": "header",
>       "start_offset": 6,
>       "end_offset": 12,
>       "type": "<ALPHANUM>",
>       "position": 1
>     },
>     {
>       "token": "A",
>       "start_offset": 20,
>       "end_offset": 21,
>       "type": "<ALPHANUM>",
>       "position": 2
>     },
>     {
>       "token": "paragraph",
>       "start_offset": 22,
>       "end_offset": 31,
>       "type": "<ALPHANUM>",
>       "position": 3
>     }
>   ]
> }

yansal · June 7, 2018, 8:54am

It doesn't work. As you can see it just erases the "english" analyzer. "A" appears as a token twice and shouldn't because it's an english article.

tdasch · June 7, 2018, 12:12pm

Yeah I see that! I slept on it and did some more looking this morning with fresh eyes. Mr. Pilato was on the money it would seem but it took me a while to figure out why.

This is my final product, which copies the English analyzer custom example from the doc Mr. Pilato linked - customization will come from your needs.

> PUT test23
> {
>   "settings": {
>     "analysis": {
>       "filter": {
>         "english_stop": {
>           "type":       "stop",
>           "stopwords":  "_english_" 
>         },
>         "english_keywords": {
>           "type":       "keyword_marker",
>           "keywords":   ["example"] 
>         },
>         "english_stemmer": {
>           "type":       "stemmer",
>           "language":   "english"
>         },
>         "english_possessive_stemmer": {
>           "type":       "stemmer",
>           "language":   "possessive_english"
>         }
>       },
>       "analyzer": {
>         "english": {
>           "tokenizer":  "standard",
>           "filter": [
>             "english_possessive_stemmer",
>             "lowercase",
>             "english_stop",
>             "english_keywords",
>             "english_stemmer"
>           ],
>           "char_filter": ["html_strip"]
>         }
>       }
>     }
>   }
> }

My Test:

> POST test23/_analyze
> {
>   "analyzer": "english",
>   "text": "<h1>A header</h1><p>A paragraph</p>"
> }

My Result:

> {
>   "tokens": [
>     {
>       "token": "header",
>       "start_offset": 6,
>       "end_offset": 12,
>       "type": "<ALPHANUM>",
>       "position": 1
>     },
>     {
>       "token": "paragraph",
>       "start_offset": 22,
>       "end_offset": 31,
>       "type": "<ALPHANUM>",
>       "position": 3
>     }
>   ]
> }

I wanted to lay out my thought process incase someone wanted to provide further insight or correct my error(s). Settings is customizing the token filters of the english analyzer (this is the part that would be tailored for your needs).Analyzer is selecting the english analyzer, setting the standard tokenizer, setting the filters customized in settings, and adding the char_filter html_strip. I hope this helps?

yansal · June 7, 2018, 1:10pm

It works by rebuilding the language analyzer from scratch. The question is is there a solution to extend an existing analyzer without rebuilding it from scratch.

dadoonet · June 7, 2018, 1:51pm

No. There's not apart the options documented for each analyzer if any.

dadoonet · June 7, 2018, 2:19pm

I'm not sure I agree with the fact you opened

I mean that the workaround is super easy as the analyzer is documented and I don't see much value of implementing this ^^^.

But let's see what the team is saying.

system · July 5, 2018, 2:19pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Adding char_filter into language analyzer Elasticsearch	1	384	March 24, 2020
Help stripping HTML tags Elasticsearch	6	588	July 6, 2017
How to get char_filter to work? Elasticsearch	14	1144	July 6, 2017
Language and HTML analyzer Elasticsearch	4	600	July 5, 2017
Sending HTML through REST API for html_strip Elasticsearch	2	968	July 5, 2017

Extend built-in analyzers

Related topics