Ran into this with query_string_query:
If the field is "Homo sapiens stromal cell derived factor, 4 (SDF4), transcript variant 2, mRNA"
- note commas after [^a-zA-Z] characters
searching: Homo sapiens stromal cell derived factor, 4 (SDF4), transcript variant 2, mRNA
- 0 results
searching: Homo sapiens stromal cell derived factor, 4 (SDF4) transcript variant 2 mRNA
- works just fine (note that we keep the comma after r, but not after the 2 or the ))
Completely bizarre: if I index the field using standard tokenizer, and no autocomplete filter, it works just fine. This surprises me, because I use standard tokenizer both during indexing, and search, only difference is at index time I also apply a generous edge_ngram filter. This is the documented solution for precise search for fields using ngram filters.
My initial thought was that the standard tokenizer does not remove commas after [^a-zA-Z] characters, and that autocomplete does, but the analyze api suggests no commas retained in either case.
Does anyone know how to solve this without resorting to manually stripping commas (and presumably other punctuation)?
Mapping:
autocomplete_filter:
type: edge_ngram
min_gram: 2
max_gram: 20
autocomplete:
type: custom
tokenizer: standard
filter:
- standard
- lowercase
- autocomplete_filter
my_standard:
tokenizer: standard
filter:
- standard
- lowercase
field index as:
field_name:
type: string
analyzer: autocomplete
search_analyzer: my_standard