Original title of this post was "A bug in the shingle token filter's token_separator option, same configuration but different behavior between json and yaml". I changed it to better reflect the nature of the problem. The problem is a general one, not specifically related to the shingle token filter.
I found a weird behavior in the shingle
token filter. Because the default shingle token filter uses a single space to separate adjacent tokens. I want it to have no space at all. I tested the following analyzer:
PUT test_analyzer
{
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"combiner2"
]
}
},
"filter": {
"combiner2": {
"max_shingle_size": 2,
"token_separator": "",
"type": "shingle"
}
}
}
}
}
Test my_analyzer with a string "how is it" returns:
token: how
token: howis
token: is
token: isit
token: it
However, when I put combiner2
into elasticsearch.yml
, and name it huohuo_combiner2
. as follows:
huohuo_combiner2:
max_shingle_size: 2
output_unigrams: true
token_separator: ""
type: shingle
I then changed the filter definition in the my_analyzer
definition above to use huohuo_combiner2
instead. However, the returned token became:
token: how
token: how is
token: is
token: is it
token: it
Note, there is a space between adjacent tokens even if I set token_separator
to ""
. If I explicitly set token_separator
to " " (one space) in huohuo_combiner2
, the result is the same with one space in tokens. If I set token_separator
to " "
(two spaces) in huohuo_combiner2
, the resultant tokens have two spaces in them.
I tried ''
(two single quotes without space), or not write anything after the colon. All return tokens with one space in them.
Oh, my environment is elasticsearch-1.6.0 in windows 8.