Whitespace tokenizer

anthony_bliecq · July 7, 2020, 1:07pm

Hi,

I'm actually using standard tokenizer wich contains whitespace tokenizer.
The problem is that i want to separate a sentence by tokens with whitespace but i also want the entire sentence as token. How is this possible ?

Concrete example:

want to parse "Hello How Are You" in these tokens --> ["hello","how","are","you","hello how are you"].

My actual config :

{
    "analysis": {
      "analyzer": {
        "smd_analyzer": {
          "type": "custom", 
          "tokenizer": "standard",
          "char_filter": [
            "html_strip",
            "smd_filter",
          ],
          "filter": [
            "lowercase",
            "asciifolding",
            "smd_length",
            "smd_stop"
          ]
        }
      },
      "char_filter": {
	        "smd_filter": {
	          "type": "pattern_replace",
	          "pattern": "(\\p{L}+)'(\\p{L}+)",
	          "replacement": "$0 $1 $2"
	        }
      },
      	"filter":{
	        "smd_length": {
	          "type": "length",
	          "min": 2
	        },
	  	    "smd_stop": {
		      "type": "stop",
	          "ignore_case": true,
	          "stopwords": [ "LE", "LA", "LES", "DU", "DES", "OU", "ET", "SI", "STE", "CIE","SOC", "GEN", "GIE", "NV", "SA", "SARL", "ST", "BS", "CP", "CV","DA", "DS", "OAT", "TP", "TSDI", "TSR", "ZZ"]
		    }
      	}
    }
}

system · August 4, 2020, 1:07pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Whitespace Tokenizer dont works as expected Elasticsearch	2	470	December 19, 2018
Analyzer settings for breaking up words on hyphens Elasticsearch	4	2242	July 6, 2017
Changing tokenizer from whitespace to standard Elasticsearch	4	2595	July 6, 2017
Stop standard tokenizer from splitting on punctuations Elasticsearch	1	441	April 26, 2022
Non-standart analizer/tokenizer Elasticsearch	2	468	July 5, 2017

Whitespace tokenizer

Concrete example:

My actual config :

Related topics