A bug in elasticsearch.yml configuration file. Cannot set an empty string as value

Original title of this post was "A bug in the shingle token filter's token_separator option, same configuration but different behavior between json and yaml". I changed it to better reflect the nature of the problem. The problem is a general one, not specifically related to the shingle token filter.

I found a weird behavior in the shingle token filter. Because the default shingle token filter uses a single space to separate adjacent tokens. I want it to have no space at all. I tested the following analyzer:

PUT test_analyzer
{
   "index": {
      "analysis": {
         "analyzer": {
            "my_analyzer": {
               "tokenizer": "standard",
               "filter": [
                  "combiner2"
               ]
            }
         },
         "filter": {
            "combiner2": {
               "max_shingle_size": 2,
               "token_separator": "",
               "type": "shingle"
            }
         }
      }
   }
}

Test my_analyzer with a string "how is it" returns:
token: how
token: howis
token: is
token: isit
token: it

However, when I put combiner2 into elasticsearch.yml, and name it huohuo_combiner2. as follows:

  huohuo_combiner2:
    max_shingle_size: 2
    output_unigrams: true
    token_separator: ""
    type: shingle

I then changed the filter definition in the my_analyzer definition above to use huohuo_combiner2 instead. However, the returned token became:

token: how
token: how is
token: is
token: is it
token: it

Note, there is a space between adjacent tokens even if I set token_separator to "". If I explicitly set token_separator to " " (one space) in huohuo_combiner2, the result is the same with one space in tokens. If I set token_separator to " " (two spaces) in huohuo_combiner2, the resultant tokens have two spaces in them.

I tried '' (two single quotes without space), or not write anything after the colon. All return tokens with one space in them.

Oh, my environment is elasticsearch-1.6.0 in windows 8.

1 Like

I found something. It is not just the shingle token filter. All keys in elasticsearch.yml, if set to "" or '', will disappear in elasticsearch's actual configuration.

I used GET /_nodes?os=true&process=true&pretty=true to get the settings of the node. I found that elasticsearch only gets this much info from my yaml definition:

"huohuo_combiner2": {
    "type": "shingle",
    "max_shingle_size": "2"
},

It removed the token_separator property.

I verified this problem in two windows machines and a linux machine.