cawoodm  
                (Marc)
               
                 
              
                  
                    March 26, 2020,  8:17pm
                   
                   
              1 
               
             
            
              We want our edge_ngram search to only to word break on whitespace.
If we use:
    "tokenizer": {
        "token_chars": ["letter", "digit", "punctuation", "symbol"],
        "min_gram": "2",
        "type": "edge_ngram",
        "max_gram": "20"
    }
 
Then it breaks words on things not considered* symbols (like =).
How can we use the edge_ngram tokenizer only tokenizing (word breaking) on whitespace.
Note: "token_chars": ["whitespace"], produces no tokens at all.
Test case: 
"some foo/text with=this.kind?of_thing" should produce ["some", "foo/text", "with=this.kind?of_thing"]
Bonus Question: Which characters are  considered symbols. It's not obvious from the source code.
             
            
               
               
               
            
            
           
          
            
              
                cawoodm  
                (Marc)
               
              
                  
                    March 27, 2020,  9:05am
                   
                   
              2 
               
             
            
              OK, found a solution using the rather poorly documented custom_token_chars:
"tokenizer": {
        "token_chars": ["letter", "digit", "punctuation", "symbol"],
        "custom_token_chars": ["="],
        "min_gram": "2",
        "type": "edge_ngram",
        "max_gram": "20"
    }
 
I can confirm that at least the following character are already covered by punctutation and symbols above: 
/ * - + „ “ * % & ! . , ; : ( ) ° ß ø
             
            
               
               
               
            
            
           
          
            
              
                spinscale  
                (Alexander Reelsen)
               
              
                  
                    March 30, 2020,  2:52pm
                   
                   
              3 
               
             
            
              dumb question: If you only want to tokenize on whitespace, why not use a whitespace tokenizer ? I guess there is some more logic done on your side?
             
            
               
               
               
            
            
           
          
            
              
                cawoodm  
                (Marc)
               
              
                  
                    March 31, 2020, 10:14am
                   
                   
              4 
               
             
            
              Indeed, I was unaware how to combine filters, analyzers and tokenizers but I got some feedback on GitHub which resulted in these settings:
    "settings": {
        "index": {
            "analysis": {
                "analyzer": {
                    "whitesp": {
                        "tokenizer": "whitespace",
                        "filter": [
                            "lowercase"
                        ]
                    },
                    "edgegram": {
                    	"tokenizer": "whitespace",
                        "filter": [
                            "lowercase",
                            "edgegram"
                        ]
                    }
                },
                "filter": {
                    "edgegram": {
                        "min_gram": "2",
                        "type": "edge_ngram",
                        "max_gram": "20"
                    }
                }
            }
        }
    },
    "mappings": {
        "dynamic": "strict",
        "properties": {
            "text_en": {
                "search_analyzer": "whitesp",
                "analyzer": "edgegram",
                "type": "text"
            },
            "text_de": {
                "search_analyzer": "whitesp",
                "analyzer": "edgegram",
                "type": "text"
            }
            ...
 
As I understand it this ensures the analysis/indexer uses edge ngram tokenizing on whitespace whilst the search analyser also tokenizes on whitespace but does not to the ngram stuff.
             
            
               
               
               
            
            
           
          
            
              
                system  
                (system)
                  Closed 
               
              
                  
                    April 28, 2020, 10:14am
                   
                   
              5 
               
             
            
              This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.