Elastic Edge_Ngram with Whitespace Word Breaker

We want our edge_ngram search to only to word break on whitespace.

If we use:

    "tokenizer": {
        "token_chars": ["letter", "digit", "punctuation", "symbol"],
        "min_gram": "2",
        "type": "edge_ngram",
        "max_gram": "20"
    }

Then it breaks words on things not considered* symbols (like =).

How can we use the edge_ngram tokenizer only tokenizing (word breaking) on whitespace.

Note: "token_chars": ["whitespace"], produces no tokens at all.

Test case:
"some foo/text with=this.kind?of_thing" should produce ["some", "foo/text", "with=this.kind?of_thing"]

Bonus Question: Which characters are considered symbols. It's not obvious from the source code.

OK, found a solution using the rather poorly documented custom_token_chars:

"tokenizer": {
        "token_chars": ["letter", "digit", "punctuation", "symbol"],
        "custom_token_chars": ["="],
        "min_gram": "2",
        "type": "edge_ngram",
        "max_gram": "20"
    }

I can confirm that at least the following character are already covered by punctutation and symbols above:
/ * - + „ “ * % & ! . , ; : ( ) ° ß ø

dumb question: If you only want to tokenize on whitespace, why not use a whitespace tokenizer? I guess there is some more logic done on your side?

Indeed, I was unaware how to combine filters, analyzers and tokenizers but I got some feedback on GitHub which resulted in these settings:

    "settings": {
        "index": {
            "analysis": {
                "analyzer": {
                    "whitesp": {
                        "tokenizer": "whitespace",
                        "filter": [
                            "lowercase"
                        ]
                    },
                    "edgegram": {
                    	"tokenizer": "whitespace",
                        "filter": [
                            "lowercase",
                            "edgegram"
                        ]
                    }
                },
                "filter": {
                    "edgegram": {
                        "min_gram": "2",
                        "type": "edge_ngram",
                        "max_gram": "20"
                    }
                }
            }
        }
    },
    "mappings": {
        "dynamic": "strict",
        "properties": {
            "text_en": {
                "search_analyzer": "whitesp",
                "analyzer": "edgegram",
                "type": "text"
            },
            "text_de": {
                "search_analyzer": "whitesp",
                "analyzer": "edgegram",
                "type": "text"
            }
            ...

As I understand it this ensures the analysis/indexer uses edge ngram tokenizing on whitespace whilst the search analyser also tokenizes on whitespace but does not to the ngram stuff.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.