I am trying to implement autocomplete for chemical substance names, e.g. 1-Methyl-1,3,4,5-tetrahydropyrrolo(4,3,2-de)isoquinoline
where it's possible to search using partial name.
E.g. yrrolo(4,3,2
I've got this working fine with nGram
tokenizer, building nGrams from 1 to 10.
Unfortunately this also needs to work for longer substrings, up to 50+ characters.
Currently I can find substances using search term up to nGram max length (which is 10).
I don't want to build nGrams up to 50 or more as it's time/space/resource consuming and I believe I can achieve it with mix of shorter nGrams and partial matching.
My idea is to use the: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html
against my nGram index.
Since my nGrams are limited to 10 characters I need to split search phrase (using search analyzer) on chunks of up to 10 (maximum length of nGram).
I couldn't find any off-the-shelv tokenizer which can split phrase on tokens of specified length, hence my use of pattern
tokenizer:
curl -XPUT "localhost:9200/my_index" -d'{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer",
"char_filter": [],
"filter": []
}
},
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": "(.{1,10})",
"group": "1"
}
}
}
}
}'
It looks promising, unfortunately I can't understand why the results from the above _analyze
(in link) are split not only by token length (which is up to 10 characters or less if there is no more), but also on comma
character.
Results:
quitelongc
omma
quitelongs
eparated
values
Expected:
quitelong,
comma,quit
elongsepar
ated,value
s
If I use different character than comma
in test analyze, e.g. dot
then results are split correctly by token length, but not by this dot
separator.
How to skip comma, so it's not considered a stop character for pattern tokenizer?