Extend built-in analyzers


(Yann Salaün) #1

My use case is to add the html_strip char_filter to an existing language analyzer.

For example, I would like to create an index like this one:

PUT /myindex
{
    "settings": {
        "analysis": {
            "analyzer": {
                "english_html_strip": {
                    "type": "english",
                    "char_filter": [
                        "html_strip"
                    ]
                }
            }
        }
    },
    "mappings": {
        "_doc": {
            "properties": {
                "description_english_html": {
                    "type": "text",
                    "analyzer": "english_html_strip"
                }
            }
        }
    }
}

I then expect the following _analyze request to both use the english analyzer and to strip html tags.

POST /myindex/_analyze
{
    "field": "description_english_html",
    "text": "<h1>A header</h1><p>A paragraph</p>"
}

However it doesn't strip html tags, see the output:

{
    "tokens": [
        {
            "token": "h1",
            "start_offset": 1,
            "end_offset": 3,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "header",
            "start_offset": 6,
            "end_offset": 12,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "h1",
            "start_offset": 14,
            "end_offset": 16,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "p",
            "start_offset": 18,
            "end_offset": 19,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "paragraph",
            "start_offset": 22,
            "end_offset": 31,
            "type": "<ALPHANUM>",
            "position": 6
        },
        {
            "token": "p",
            "start_offset": 33,
            "end_offset": 34,
            "type": "<ALPHANUM>",
            "position": 7
        }
    ]
}

Is there a solution, besides rebuilding the language analyzer from scratch?


(David Pilato) #2

I'd follow this: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html#english-analyzer


(Thomas Dasch) #3

I have no idea if this is correct, but it appears to work. I feel like i've gotten it wrong but maybe it can help you reach the right answer. The documentation was confusing on adding a character filter to a language analyzer.

> PUT test23
> {
>  "settings": {
>    "analysis": {
>      "analyzer": {
>        "english": {
>          "type": "custom",
>          "char_filter": ["html_strip"],
>          "tokenizer": "standard"
>        }
>      }
>    }
>  } 
> }


> POST test23/_analyze
> {
>   "analyzer": "english",
>   "text": "<h1>A header</h1><p>A paragraph</p>"
> }

> {
>   "tokens": [
>     {
>       "token": "A",
>       "start_offset": 4,
>       "end_offset": 5,
>       "type": "<ALPHANUM>",
>       "position": 0
>     },
>     {
>       "token": "header",
>       "start_offset": 6,
>       "end_offset": 12,
>       "type": "<ALPHANUM>",
>       "position": 1
>     },
>     {
>       "token": "A",
>       "start_offset": 20,
>       "end_offset": 21,
>       "type": "<ALPHANUM>",
>       "position": 2
>     },
>     {
>       "token": "paragraph",
>       "start_offset": 22,
>       "end_offset": 31,
>       "type": "<ALPHANUM>",
>       "position": 3
>     }
>   ]
> }

(Yann Salaün) #4

It doesn't work. As you can see it just erases the "english" analyzer. "A" appears as a token twice and shouldn't because it's an english article.


(Thomas Dasch) #5

Yeah I see that! I slept on it and did some more looking this morning with fresh eyes. Mr. Pilato was on the money it would seem but it took me a while to figure out why.

This is my final product, which copies the English analyzer custom example from the doc Mr. Pilato linked - customization will come from your needs.

> PUT test23
> {
>   "settings": {
>     "analysis": {
>       "filter": {
>         "english_stop": {
>           "type":       "stop",
>           "stopwords":  "_english_" 
>         },
>         "english_keywords": {
>           "type":       "keyword_marker",
>           "keywords":   ["example"] 
>         },
>         "english_stemmer": {
>           "type":       "stemmer",
>           "language":   "english"
>         },
>         "english_possessive_stemmer": {
>           "type":       "stemmer",
>           "language":   "possessive_english"
>         }
>       },
>       "analyzer": {
>         "english": {
>           "tokenizer":  "standard",
>           "filter": [
>             "english_possessive_stemmer",
>             "lowercase",
>             "english_stop",
>             "english_keywords",
>             "english_stemmer"
>           ],
>           "char_filter": ["html_strip"]
>         }
>       }
>     }
>   }
> }

My Test:

> POST test23/_analyze
> {
>   "analyzer": "english",
>   "text": "<h1>A header</h1><p>A paragraph</p>"
> }

My Result:

> {
>   "tokens": [
>     {
>       "token": "header",
>       "start_offset": 6,
>       "end_offset": 12,
>       "type": "<ALPHANUM>",
>       "position": 1
>     },
>     {
>       "token": "paragraph",
>       "start_offset": 22,
>       "end_offset": 31,
>       "type": "<ALPHANUM>",
>       "position": 3
>     }
>   ]
> }

I wanted to lay out my thought process incase someone wanted to provide further insight or correct my error(s). Settings is customizing the token filters of the english analyzer (this is the part that would be tailored for your needs).Analyzer is selecting the english analyzer, setting the standard tokenizer, setting the filters customized in settings, and adding the char_filter html_strip. I hope this helps?


(Yann Salaün) #6

It works by rebuilding the language analyzer from scratch. The question is is there a solution to extend an existing analyzer without rebuilding it from scratch.


(David Pilato) #7

No. There's not apart the options documented for each analyzer if any.


(David Pilato) #8

I'm not sure I agree with the fact you opened

I mean that the workaround is super easy as the analyzer is documented and I don't see much value of implementing this ^^^.

But let's see what the team is saying.


(system) #9

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.