nGram ordered partial/phrase matching - how to split search token by length?

tomekit · October 15, 2018, 2:57pm

I am trying to implement autocomplete for chemical substance names, e.g. 1-Methyl-1,3,4,5-tetrahydropyrrolo(4,3,2-de)isoquinoline where it's possible to search using partial name.
E.g. yrrolo(4,3,2

I've got this working fine with nGram tokenizer, building nGrams from 1 to 10.
Unfortunately this also needs to work for longer substrings, up to 50+ characters.
Currently I can find substances using search term up to nGram max length (which is 10).
I don't want to build nGrams up to 50 or more as it's time/space/resource consuming and I believe I can achieve it with mix of shorter nGrams and partial matching.

My idea is to use the: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html
against my nGram index.
Since my nGrams are limited to 10 characters I need to split search phrase (using search analyzer) on chunks of up to 10 (maximum length of nGram).

I couldn't find any off-the-shelv tokenizer which can split phrase on tokens of specified length, hence my use of pattern tokenizer:

curl -XPUT "localhost:9200/my_index" -d'{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer",
          "char_filter": [],
          "filter": []
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": "(.{1,10})",
          "group": "1"
        }
      }
    }
  }
}'

Test link: http://localhost:9200/my_index/_analyze?text=quitelongcomma,quitelongseparated,values&analyzer=my_analyzer

It looks promising, unfortunately I can't understand why the results from the above _analyze (in link) are split not only by token length (which is up to 10 characters or less if there is no more), but also on comma character.
Results:
quitelongc
omma
quitelongs
eparated
values

Expected:
quitelong,
comma,quit
elongsepar
ated,value
s

If I use different character than comma in test analyze, e.g. dot then results are split correctly by token length, but not by this dot separator.
How to skip comma, so it's not considered a stop character for pattern tokenizer?

system · November 12, 2018, 2:57pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Query returning false results when term exceeds ngram length Elasticsearch	6	1569	January 16, 2018
Tokenizer to get combinations of words Elasticsearch	2	864	November 14, 2018
Search using ngram max_ngram Elasticsearch	2	552	March 29, 2018
Elasticsearch ngram tokenizer Elasticsearch	4	807	February 10, 2020
Using Exact Prefix/MatchPhrase Prefix Queries with Ngram Filter Elasticsearch	2	696	September 9, 2020

nGram ordered partial/phrase matching - how to split search token by length?

Related topics