Term negation and fuzziness

The use case is a search engine over text documents for the general public.

We were previously using a simple match query, but recently switched to simple_query_string in order to easily support phrase matching.

I'm finding this transition difficult to deal with:

  1. Fuzziness:AUTO is not supported by simple_query_string, but we cannot expect users to manually append ~N to each term in their query. Is it standard practice to modify the query on the server side before querying Elasticsearch? Can this be smoothly accomplished instead through character/token filters or some other analyzer? I tried playing with character filters in custom analyzers for the past couple hours, but it seems like simple_query_string parses the special operators before analyzing the text. The documentation seems to support that:

This query uses a simple syntax to parse and split the provided query string into terms based on special operators. The query then analyzes each term independently before returning matching documents.

  1. The NOT operator has a similar issue since we're using the default_operator: OR. We can't expect users to write "+-" when they want to exclude terms from their search. What is the best practice?

Alternatively, should I have stuck with the other full-text queries? Without simple_query_string, it seems like supporting phrase matching and other common search operations would involve parsing the query on the server side, then constructing complex compound queries before sending to Elasticsearch.

I'm not opposed to constructing complex queries in order to support operations like phrase matching, term exclusion, fuzziness, etc., but I want to ask what's best practice or recommended. I'd like to avoid reinventing the wheel if Elastic/Lucene provides simpler solutions that I'm overlooking.

Thank you for your help!

Concretely, these are the two things I've considered:

  1. Modify user queries before querying with simple_query_string, for example:
"\"attention is all you need\" transformers -\"cross-attention\""
--> "+\"attention is all you need\" transformers +-\"cross-attention\""
  1. Parse user queries and construct compound bool queries:
"\"attention is all you need\" transformers -\"cross-attention\"
 --> bool { must: [match_phrase("attention is all you need")], must_not: [match_phrase("cross-attention")], should: [match("transformers")] }

It feels like this should be a very common problem, but I cannot find much information online at all. What is the recommended solution for this?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.