Issues with commas after non-word character && autocomplete

Ran into this with query_string_query:

If the field is "Homo sapiens stromal cell derived factor, 4 (SDF4), transcript variant 2, mRNA"

  • note commas after [^a-zA-Z] characters

searching: Homo sapiens stromal cell derived factor, 4 (SDF4), transcript variant 2, mRNA

  • 0 results

searching: Homo sapiens stromal cell derived factor, 4 (SDF4) transcript variant 2 mRNA

  • works just fine (note that we keep the comma after r, but not after the 2 or the ))

Completely bizarre: if I index the field using standard tokenizer, and no autocomplete filter, it works just fine. This surprises me, because I use standard tokenizer both during indexing, and search, only difference is at index time I also apply a generous edge_ngram filter. This is the documented solution for precise search for fields using ngram filters.

My initial thought was that the standard tokenizer does not remove commas after [^a-zA-Z] characters, and that autocomplete does, but the analyze api suggests no commas retained in either case.

Does anyone know how to solve this without resorting to manually stripping commas (and presumably other punctuation)?

Mapping:

autocomplete_filter:
type: edge_ngram
min_gram: 2
max_gram: 20

autocomplete:
type: custom
tokenizer: standard
filter:
- standard
- lowercase
- autocomplete_filter

my_standard:
tokenizer: standard
filter:
- standard
- lowercase

field index as:

field_name:
type: string
analyzer: autocomplete
search_analyzer: my_standard

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.