Path_hierarchy with custom delimiter fails when values starts with a number

Hi guys,

I'm facing a weird issue using path_hierarchy with delimiter ".". When part of the path starts with a number, some queries fail. When all parts start with a letter, it works fine. So, in the following example:

  • “NM_052827.a4" works fine (values on each part of “.” start with “N” and “a”)
  • "nm_001798.1" fails, a value starts with number “1”.

Here’s the issue I’ve been able to narrow down:

1. define custom tokenizer in settings
curl -XPUT 'http://localhost:9200/testindex' -d ' { "settings": { "index": { "analysis": { "analyzer": { "dot_analyzer": { "type": "custom", "tokenizer": "dot_tokenizer" } }, "tokenizer": { "dot_tokenizer": { "type": "path_hierarchy", "delimiter": "." } } } } } }'

2. define mapping
curl -XPUT 'http://localhost:9200/testindex/_mapping/onetype' -d ' { "onetype": { "properties": { "root": { "analyzer": "dot_analyzer", "type": "string" }, "nested": { "dynamic": "false", "properties": { "one": { "analyzer": "dot_analyzer", "type": "string" }, "two": { "analyzer": "dot_analyzer", "type": "string" } } } } } } '

3. save doc
curl -XPUT 'http://localhost:9200/testindex/onetype/1017' -d '{ "root" : "root1.value1", "nested": { "one": [ "nested1.one1", "nested2.one2", "nested3.one3" ], "two": [ "NM_001290230.4", "NM_052827.a4", "XM_011537732.4", "nested4.4", "nested4001798.two1", "NM_4001798.two1", "nm_001798.1" ] } }'

4. works OK
curl -XPOST 'http://localhost:9200/testindex/onetype/_search' -d '{ "query": { "query_string": { "query": "nm_4001798", "default_operator": "AND", "auto_generate_phrase_queries": true } } }'

5. but this one fails.
curl -XPOST 'http://localhost:9200/testindex/onetype/_search' -d '{ "query": { "query_string": { "query": "nm_001798", "default_operator": "AND", "auto_generate_phrase_queries": true } } }'

6. though using field explicitly, it works…
curl -XPOST 'http://localhost:9200/testindex/onetype/_search' -d '{ "query": { "query_string": { "query": "nested.two:nm_001798", "default_operator": "AND", "auto_generate_phrase_queries": true } } }'

So, is there something wrong with my settings/queries ? Or is this a bug somewhere near path_hierarchy tokenizer ?

I'm using ES 2.3

Thanks
Best,
Sebastien.