Issues with commas after non-word character && autocomplete

akotlar · February 10, 2017, 10:39pm

Ran into this with query_string_query:

If the field is "Homo sapiens stromal cell derived factor, 4 (SDF4), transcript variant 2, mRNA"

note commas after [^a-zA-Z] characters

searching: Homo sapiens stromal cell derived factor, 4 (SDF4), transcript variant 2, mRNA

0 results

searching: Homo sapiens stromal cell derived factor, 4 (SDF4) transcript variant 2 mRNA

works just fine (note that we keep the comma after r, but not after the 2 or the ))

Completely bizarre: if I index the field using standard tokenizer, and no autocomplete filter, it works just fine. This surprises me, because I use standard tokenizer both during indexing, and search, only difference is at index time I also apply a generous edge_ngram filter. This is the documented solution for precise search for fields using ngram filters.

My initial thought was that the standard tokenizer does not remove commas after [^a-zA-Z] characters, and that autocomplete does, but the analyze api suggests no commas retained in either case.

Does anyone know how to solve this without resorting to manually stripping commas (and presumably other punctuation)?

Mapping:

autocomplete_filter:
type: edge_ngram
min_gram: 2
max_gram: 20

autocomplete:
type: custom
tokenizer: standard
filter:
- standard
- lowercase
- autocomplete_filter

my_standard:
tokenizer: standard
filter:
- standard
- lowercase

field index as:

field_name:
type: string
analyzer: autocomplete
search_analyzer: my_standard

system · March 10, 2017, 10:39pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Special characters for ngram Elasticsearch	1	776	October 15, 2018
Search as you type for documents with digits, unicode and special characters Elasticsearch	3	403	March 27, 2023
How to and efficient way to combine standard tokenizer with autocomplete (type ahead) functionality Elasticsearch	1	342	July 6, 2017
Completion Suggester ignores commas, dashes Elasticsearch	1	597	July 5, 2017
Performance issue by using edge ngram Elasticsearch	8	2172	September 7, 2018

Issues with commas after non-word character && autocomplete

Related topics