nGram ordered partial/phrase matching - how to split search token by length?


(Tomekit) #1

I am trying to implement autocomplete for chemical substance names, e.g. 1-Methyl-1,3,4,5-tetrahydropyrrolo(4,3,2-de)isoquinoline where it's possible to search using partial name.
E.g. yrrolo(4,3,2

I've got this working fine with nGram tokenizer, building nGrams from 1 to 10.
Unfortunately this also needs to work for longer substrings, up to 50+ characters.
Currently I can find substances using search term up to nGram max length (which is 10).
I don't want to build nGrams up to 50 or more as it's time/space/resource consuming and I believe I can achieve it with mix of shorter nGrams and partial matching.

My idea is to use the: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html
against my nGram index.
Since my nGrams are limited to 10 characters I need to split search phrase (using search analyzer) on chunks of up to 10 (maximum length of nGram).

I couldn't find any off-the-shelv tokenizer which can split phrase on tokens of specified length, hence my use of pattern tokenizer:

curl -XPUT "localhost:9200/my_index" -d'{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer",
          "char_filter": [],
          "filter": []
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": "(.{1,10})",
          "group": "1"
        }
      }
    }
  }
}'

Test link: http://localhost:9200/my_index/_analyze?text=quitelongcomma,quitelongseparated,values&analyzer=my_analyzer

It looks promising, unfortunately I can't understand why the results from the above _analyze (in link) are split not only by token length (which is up to 10 characters or less if there is no more), but also on comma character.
Results:
quitelongc
omma
quitelongs
eparated
values

Expected:
quitelong,
comma,quit
elongsepar
ated,value
s

If I use different character than comma in test analyze, e.g. dot then results are split correctly by token length, but not by this dot separator.
How to skip comma, so it's not considered a stop character for pattern tokenizer?


(system) #2

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.