Term negation and fuzziness

cvarano · August 25, 2023, 6:25am

The use case is a search engine over text documents for the general public.

We were previously using a simple match query, but recently switched to simple_query_string in order to easily support phrase matching.

I'm finding this transition difficult to deal with:

Fuzziness:AUTO is not supported by simple_query_string, but we cannot expect users to manually append ~N to each term in their query. Is it standard practice to modify the query on the server side before querying Elasticsearch? Can this be smoothly accomplished instead through character/token filters or some other analyzer? I tried playing with character filters in custom analyzers for the past couple hours, but it seems like simple_query_string parses the special operators before analyzing the text. The documentation seems to support that:

This query uses a simple syntax to parse and split the provided query string into terms based on special operators. The query then analyzes each term independently before returning matching documents.

The NOT operator has a similar issue since we're using the default_operator: OR. We can't expect users to write "+-" when they want to exclude terms from their search. What is the best practice?

Alternatively, should I have stuck with the other full-text queries? Without simple_query_string, it seems like supporting phrase matching and other common search operations would involve parsing the query on the server side, then constructing complex compound queries before sending to Elasticsearch.

I'm not opposed to constructing complex queries in order to support operations like phrase matching, term exclusion, fuzziness, etc., but I want to ask what's best practice or recommended. I'd like to avoid reinventing the wheel if Elastic/Lucene provides simpler solutions that I'm overlooking.

Thank you for your help!

cvarano · August 28, 2023, 9:47pm

Concretely, these are the two things I've considered:

Modify user queries before querying with simple_query_string, for example:

"\"attention is all you need\" transformers -\"cross-attention\""
--> "+\"attention is all you need\" transformers +-\"cross-attention\""

Parse user queries and construct compound bool queries:

"\"attention is all you need\" transformers -\"cross-attention\"
 --> bool { must: [match_phrase("attention is all you need")], must_not: [match_phrase("cross-attention")], should: [match("transformers")] }

It feels like this should be a very common problem, but I cannot find much information online at all. What is the recommended solution for this?

system · September 25, 2023, 9:48pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
(Newbie) Differences between text and field/query_string, and matching words vs phrases Elasticsearch	6	696	July 6, 2017
Comples proximity searches using the simple_query_string Elasticsearch	2	480	February 6, 2018
Query string query: default fuzziness? Elasticsearch	5	477	October 30, 2018
Query_string query with fuzzy matching enabled without explicit `~` operators Elasticsearch	1	379	October 2, 2019
How to remove fuzziness in Simple Query String? Elasticsearch	8	2448	March 21, 2020

Term negation and fuzziness

Related topics