We are aiming to implement a naive search in Elasticsearch that functions exactly like a Ctrl+F search. Specifically, a query like "rd1 wo" should match "word1 word2" because it is a partial match in sequence.
The search must return only the exact sequence of matching results and exclude everything else. While this behavior seems straightforward, achieving it within Elasticsearch has proven unexpectedly complex due to its design and behavior.
Despite the temptation to use simpler methods, the requirement is to implement this directly in Elasticsearch.
Key Requirements:
Support partial matches (sub-word queries).
Handle both single-word queries and multi-word phrases.
Ensure sub-word queries work for sequences of two or more partial words.
Exclude all non-matching results—no false positives allowed.
Currently, we're using a combination of match_phrase and an n-gram filter, followed by post-processing to filter out non-matching results. However, this approach is inefficient and not ideal.
Any guidance or solutions to achieve this in Elasticsearch would be greatly appreciated.
What exactly did you try with the field type I pointed to and what about it did not work? I think it should meet most, if not all, of your stated requirements.
Please show mappings, sample document(s) and queries so we can recreate and get a better understanding of what is and is not working.
Awesome! I just tried it, and it works perfectly!
I hadn’t realized there was such a wildcard type for fields—I thought it was only available as a query type.
Thanks so much!
I need further help with constructing the correct regular expression for a query that meets the following requirements:
a. Exact phrase with word boundaries:
The query should match the exact phrase, ensuring word boundaries are respected.
b. Partial phrase:
The query should also allow searching for partial phrases.
Both types of searches should have the option to decide whether they are sensitive to punctuation or not.
The wildcard field is not “sensitive to punctuation” by design. Like the “Ctrl F” behaviour you described - it searches everything.
If you want to allow for skipping punctuation, that would require a regex that lists the choices of permitted characters between words eg commas, carriage returns etc. Obviously several of these would need escaping to avoid clashes with the reserved characters commonly found in regular expressions like full stops.
What I have tried:
`
def construct_regex_query(phrase, exact=True, punctuation_sensitive=True):
# Escape the input string to safely include it in the query
escaped_phrase = re.escape(phrase)
if punctuation_sensitive:
# Exact match with punctuation sensitivity (using word boundaries)
if exact:
# Use \b for word boundaries to match the exact phrase, allowing other words around it
return {
"query": {
"regexp": {
"text": {
"value": f".*\b{escaped_phrase}\b.*" # Word boundary match
}
}
}
}
else:
# Partial match with punctuation sensitivity (allowing wildcards in between)
return {
"query": {
"regexp": {
"text": {
"value": f".*{escaped_phrase}.*" # Partial match
}
}
}
}
else:
# Remove punctuation to make the search insensitive to punctuation
phrase_without_punctuation = re.sub(r'[^\w\s]', '', phrase)
escaped_phrase_no_punc = re.escape(phrase_without_punctuation)
if exact:
# Use word boundaries even after removing punctuation
return {
"query": {
"regexp": {
"text": {
"value": f".*\b{escaped_phrase_no_punc}\b.*" # Word boundary match without punctuation
}
}
}
}
else:
return {
"query": {
"regexp": {
"text": {
"value": f".*{escaped_phrase_no_punc}.*" # Partial match without punctuation
}
}
}
}
`
The partial match appears to work as expected, but the exact match still doesn't function at all in this implementation.
I came across a StackOverflow answer (which I can't share due to link restrictions) that mentions \b can't be used in Elasticsearch, but the alternative suggestion also didn't work.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.