Configuring the standard tokenizer

Robin_Hughes · August 9, 2012, 11:09am

Hi

We use the "standard" tokenizer in custom analyzer definitions. By default
the standard tokenizer splits words on hyphens and ampersands, so for
example "i-mac" is tokenized to "i" and "mac"

Is there any way to configure the behaviour of the standard tokenizer to
stop it splitting words on hyphens and ampersands, while still doing all
the normal tokenizing it does on other punctuation?

We could define our own "Pattern" tokenizer using a regex but it seems like
any custom regex we write is unlikely to deal with the myriad unicode
characters the standard tokenizer handles already.

Regards,
Robin

Igor_Motov · August 9, 2012, 5:23pm

The standard tokenizer is following the Unicode Standard Annex #29, and
doesn't really have any settings besides version and max_token_length. I am
not sure what's your use case, but one possible solution that comes to mind
would be to replace hyphens and ampersands with symbols that don't cause
words to be split, or just remove them. For example, you can replace
hyphens with underscores and remove ampersands. Another interesting
replacement character is ".".

You can achieve it using Mapping Char Filterhttp://www.elasticsearch.org/guide/reference/index-modules/analysis/mapping-charfilter.html
.

$ curl -XPUT http://localhost:9200/test -d '{
"settings" : {
"index" : {
"number_of_shards" : 5,
"number_of_replicas" : 0,
"analysis" : {
"char_filter" : {
"my_filter" : {
"type" : "mapping",
"mappings" : ["'''=>", "-=>_"]
}
},
"analyzer" : {
"my_standard" : {
"type": "custom",
"tokenizer" : "standard",
"char_filter" : ["my_filter"],
"filter" : ["standard", "lowercase", "stop"]
}
}
}
}
}
}'
{"ok":true,"acknowledged":true}
$ curl "localhost:9200/test/_analyze?analyzer=my_standard&pretty=true" -d
"Robin's i-mac is nice."
{
"tokens" : [ {
"token" : "robins",
"start_offset" : 0,
"end_offset" : 7,
"type" : "",
"position" : 1
}, {
"token" : "i_mac",
"start_offset" : 8,
"end_offset" : 13,
"type" : "",
"position" : 2
}, {
"token" : "nice",
"start_offset" : 17,
"end_offset" : 21,
"type" : "",
"position" : 4
} ]
}

On Thursday, August 9, 2012 7:09:12 AM UTC-4, Robin Hughes wrote:

Hi

We use the "standard" tokenizer in custom analyzer definitions. By default
the standard tokenizer splits words on hyphens and ampersands, so for
example "i-mac" is tokenized to "i" and "mac"

Is there any way to configure the behaviour of the standard tokenizer to
stop it splitting words on hyphens and ampersands, while still doing all
the normal tokenizing it does on other punctuation?

We could define our own "Pattern" tokenizer using a regex but it seems
like any custom regex we write is unlikely to deal with the myriad unicode
characters the standard tokenizer handles already.

Regards,
Robin

Robin_Hughes · August 10, 2012, 2:07pm

Thanks Igor for a very elegant solution!

Regards,

Robin

Roeland_Van_Heddegem · June 19, 2015, 12:54pm

Hi,

What about a simple query string query like this:

{
  "_source":true,
  "query":{
    "simple_query_string":{
      "query":"i-m*",
      "default_operator":"AND"
    }
  }
}

This should match "Robin's i-mac is nice.", but I get no results. Why?

Roeland

Igor_Motov · June 20, 2015, 9:25pm

By default the simple_query_string query doesn't analyze the words with wildcards. As a result it searches for all tokens that start with i-ma. The word i-mac doesn't match this request because during analysis it's split into two tokens i and mac and neither of these tokens starts with i-ma. In order to make this query find i-mac you need to make it analyze wildcards:

{
  "_source":true,
  "query":{
    "simple_query_string":{
      "query":"i-ma*",
      "analyze_wildcard": true,
      "default_operator":"AND"
    }
  }
}

Igor

Roeland_Van_Heddegem · June 22, 2015, 8:23am

Thanks for answering Igor, it works.

But, I couldn't find this parameter in the documentation of simple_query_string query?

see Simple Query String Query

PS: I had also made a question on stackoverflow: ElasticSearch - Searching with hyphens, that I answered myself by quoting you.

Igor_Motov · June 22, 2015, 10:51pm

Indeed, this parameter was added in 1.5, but documentation wasn't updated back then. I fixed that.

prayas · August 11, 2016, 11:29am

Hi

I have to search order number which is like XXXX-XXX. - is separating this into separate tokens. I used charfilter to make it as token XXXX_XXX so that i could search that order number as something like XXXX_X*.

This is what i had done. Am i missing something?? I am not getting the desired result. I am using version 1.4.2.
{
"settings": {
"analysis": {
"char_filter": {
"myfilter": {
"type": "mapping",
"mappings": [ "-=>_"]
}},

     "analyzer": {
       
        "special_analyzer": {
        "type": "custom",
        "tokenizer" : "standard", 
        "char_filter":  ["myfilter" ],
        "filter": ["standard", "lowercase", "stop"]
        }
     }
  }

},
"type" : "jdbc",
"jdbc" : {
"strategy" : "simple",
"url" : "jdbc:datadirect:openedge://10.0.1.169:2032;databaseName=admin11.db",
"user" : "sysprogress",
"password" : "sysprogress",
"schedule" : "0 0-59 0-23 ? * *",
"sql" : "SELECT 'order,' + cast(order.rowid as varchar(255)) as '_id' ,order.ordnr,order.sltl from PUB.order",
"index" : "admin11",
"type" : "order",
"order":{
"properties":{
"ordnr":{

                "type":"string",
                "analyzer":"special_analyzer"
                
            }
        }
        
    }
}

}

Topic		Replies	Views
Stop standard tokenizer from splitting on punctuations Elasticsearch	1	347	April 26, 2022
Pattern_replace char filter regex Elasticsearch	2	707	June 28, 2017
Analyzer settings for breaking up words on hyphens Elasticsearch	4	2218	July 6, 2017
Configuring the standard tokenizer elasticsearch Elasticsearch	2	449	October 30, 2018
Altering the standard analyzer Elasticsearch	3	758	July 5, 2017

Configuring the standard tokenizer

Related topics