Configuring the standard tokenizer

Hi

We use the "standard" tokenizer in custom analyzer definitions. By default
the standard tokenizer splits words on hyphens and ampersands, so for
example "i-mac" is tokenized to "i" and "mac"

Is there any way to configure the behaviour of the standard tokenizer to
stop it splitting words on hyphens and ampersands, while still doing all
the normal tokenizing it does on other punctuation?

We could define our own "Pattern" tokenizer using a regex but it seems like
any custom regex we write is unlikely to deal with the myriad unicode
characters the standard tokenizer handles already.

Regards,
Robin

The standard tokenizer is following the Unicode Standard Annex #29, and
doesn't really have any settings besides version and max_token_length. I am
not sure what's your use case, but one possible solution that comes to mind
would be to replace hyphens and ampersands with symbols that don't cause
words to be split, or just remove them. For example, you can replace
hyphens with underscores and remove ampersands. Another interesting
replacement character is ".".

You can achieve it using Mapping Char Filterhttp://www.elasticsearch.org/guide/reference/index-modules/analysis/mapping-charfilter.html
.

$ curl -XPUT http://localhost:9200/test -d '{
"settings" : {
"index" : {
"number_of_shards" : 5,
"number_of_replicas" : 0,
"analysis" : {
"char_filter" : {
"my_filter" : {
"type" : "mapping",
"mappings" : ["'''=>", "-=>_"]
}
},
"analyzer" : {
"my_standard" : {
"type": "custom",
"tokenizer" : "standard",
"char_filter" : ["my_filter"],
"filter" : ["standard", "lowercase", "stop"]
}
}
}
}
}
}'
{"ok":true,"acknowledged":true}
$ curl "localhost:9200/test/_analyze?analyzer=my_standard&pretty=true" -d
"Robin's i-mac is nice."
{
"tokens" : [ {
"token" : "robins",
"start_offset" : 0,
"end_offset" : 7,
"type" : "",
"position" : 1
}, {
"token" : "i_mac",
"start_offset" : 8,
"end_offset" : 13,
"type" : "",
"position" : 2
}, {
"token" : "nice",
"start_offset" : 17,
"end_offset" : 21,
"type" : "",
"position" : 4
} ]
}

On Thursday, August 9, 2012 7:09:12 AM UTC-4, Robin Hughes wrote:

Hi

We use the "standard" tokenizer in custom analyzer definitions. By default
the standard tokenizer splits words on hyphens and ampersands, so for
example "i-mac" is tokenized to "i" and "mac"

Is there any way to configure the behaviour of the standard tokenizer to
stop it splitting words on hyphens and ampersands, while still doing all
the normal tokenizing it does on other punctuation?

We could define our own "Pattern" tokenizer using a regex but it seems
like any custom regex we write is unlikely to deal with the myriad unicode
characters the standard tokenizer handles already.

Regards,
Robin

1 Like

Thanks Igor for a very elegant solution!

Regards,

Robin

Hi,

What about a simple query string query like this:

{
  "_source":true,
  "query":{
    "simple_query_string":{
      "query":"i-m*",
      "default_operator":"AND"
    }
  }
}

This should match "Robin's i-mac is nice.", but I get no results. Why?

Roeland

By default the simple_query_string query doesn't analyze the words with wildcards. As a result it searches for all tokens that start with i-ma. The word i-mac doesn't match this request because during analysis it's split into two tokens i and mac and neither of these tokens starts with i-ma. In order to make this query find i-mac you need to make it analyze wildcards:

{
  "_source":true,
  "query":{
    "simple_query_string":{
      "query":"i-ma*",
      "analyze_wildcard": true,
      "default_operator":"AND"
    }
  }
}

Igor

1 Like

Thanks for answering Igor, it works.

But, I couldn't find this parameter in the documentation of simple_query_string query?

see Simple Query String Query

PS: I had also made a question on stackoverflow: ElasticSearch - Searching with hyphens, that I answered myself by quoting you.

Indeed, this parameter was added in 1.5, but documentation wasn't updated back then. I fixed that.

Hi

I have to search order number which is like XXXX-XXX. - is separating this into separate tokens. I used charfilter to make it as token XXXX_XXX so that i could search that order number as something like XXXX_X*.

This is what i had done. Am i missing something?? I am not getting the desired result. I am using version 1.4.2.
{
"settings": {
"analysis": {
"char_filter": {
"myfilter": {
"type": "mapping",
"mappings": [ "-=>_"]
}},

     "analyzer": {
       
        "special_analyzer": {
        "type": "custom",
        "tokenizer" : "standard", 
        "char_filter":  ["myfilter" ],
        "filter": ["standard", "lowercase", "stop"]
        }
     }
  }

},
"type" : "jdbc",
"jdbc" : {
"strategy" : "simple",
"url" : "jdbc:datadirect:openedge://10.0.1.169:2032;databaseName=admin11.db",
"user" : "sysprogress",
"password" : "sysprogress",
"schedule" : "0 0-59 0-23 ? * *",
"sql" : "SELECT 'order,' + cast(order.rowid as varchar(255)) as '_id' ,order.ordnr,order.sltl from PUB.order",
"index" : "admin11",
"type" : "order",
"order":{
"properties":{
"ordnr":{

                "type":"string",
                "analyzer":"special_analyzer"
                
            }
        }
        
    }
}

}