I am using Elasticsearch with C# NEST and my application does some cleansing and preparation when getting a search term in the API.
A few scenarios:
- Removal of special characters, such as
" \ ' ( ) [ ] { }
- Escaping certain special characters, such as
* + -
- Including wildcard operators around each word match more results, such as
Hello Goodbye
becomes*Hello* *Goodbye*
- Including fuzziness on big words to fix potential typos, such as
Elasticsearch
becomes(Elasticsearch~ OR Elasticsearch*)
Hello, Elasticsearch! (Vitor-san said)
becomes something like *Hello* (Elasticsearch~ OR Elasticsearch*) *Vitor\-san* *said*
This is a pretty cool implementation that ensures a much higher match rate when searching, but it also causes a strong hit on the server in certain scenarios. It also is quite hard to maintain because doing so much manipulation of the search term sometimes leads to some unexpected output. We had several instances where a certain escape didn't work, or a certain expression created a lot of asterisks.
I was wondering if anyone knows a library or a methodology for preparing and cleansing search terms.
We want to manipulate the search to improve match rate and prevent special characters being misused, but there is a tradeoff and it is hard to find a balance in this case.
Thanks in advance.