Elastic Edge_Ngram with Whitespace Word Breaker

cawoodm · March 26, 2020, 8:17pm

We want our edge_ngram search to only to word break on whitespace.

If we use:

    "tokenizer": {
        "token_chars": ["letter", "digit", "punctuation", "symbol"],
        "min_gram": "2",
        "type": "edge_ngram",
        "max_gram": "20"
    }

Then it breaks words on things not considered* symbols (like =).

How can we use the edge_ngram tokenizer only tokenizing (word breaking) on whitespace.

Note: "token_chars": ["whitespace"], produces no tokens at all.

Test case:
"some foo/text with=this.kind?of_thing" should produce ["some", "foo/text", "with=this.kind?of_thing"]

Bonus Question: Which characters are considered symbols. It's not obvious from the source code.

cawoodm · March 27, 2020, 9:05am

OK, found a solution using the rather poorly documented custom_token_chars:

"tokenizer": {
        "token_chars": ["letter", "digit", "punctuation", "symbol"],
        "custom_token_chars": ["="],
        "min_gram": "2",
        "type": "edge_ngram",
        "max_gram": "20"
    }

I can confirm that at least the following character are already covered by punctutation and symbols above:
/ * - + „ “ * % & ! . , ; : ( ) ° ß ø

spinscale · March 30, 2020, 2:52pm

dumb question: If you only want to tokenize on whitespace, why not use a whitespace tokenizer? I guess there is some more logic done on your side?

cawoodm · March 31, 2020, 10:14am

Indeed, I was unaware how to combine filters, analyzers and tokenizers but I got some feedback on GitHub which resulted in these settings:

    "settings": {
        "index": {
            "analysis": {
                "analyzer": {
                    "whitesp": {
                        "tokenizer": "whitespace",
                        "filter": [
                            "lowercase"
                        ]
                    },
                    "edgegram": {
                    	"tokenizer": "whitespace",
                        "filter": [
                            "lowercase",
                            "edgegram"
                        ]
                    }
                },
                "filter": {
                    "edgegram": {
                        "min_gram": "2",
                        "type": "edge_ngram",
                        "max_gram": "20"
                    }
                }
            }
        }
    },
    "mappings": {
        "dynamic": "strict",
        "properties": {
            "text_en": {
                "search_analyzer": "whitesp",
                "analyzer": "edgegram",
                "type": "text"
            },
            "text_de": {
                "search_analyzer": "whitesp",
                "analyzer": "edgegram",
                "type": "text"
            }
            ...

As I understand it this ensures the analysis/indexer uses edge ngram tokenizing on whitespace whilst the search analyser also tokenizes on whitespace but does not to the ngram stuff.

system · April 28, 2020, 10:14am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Tokenizer: whitespace not working with edge_ngram Elasticsearch	9	2493	March 5, 2018
Edgengram doesn't give tokens after spaces Elasticsearch	2	380	June 15, 2018
How to whitelist terms in a custom analyzer Elasticsearch	4	1141	August 10, 2017
Edge Ngram Analzer with tokenchars "letter" "digit" and "-" Elasticsearch	1	683	September 16, 2018
Tokenization help mixed n-grams Elasticsearch	2	362	July 6, 2017

Elastic Edge_Ngram with Whitespace Word Breaker

Related topics