Ctrl+F search behavior in elastic

We are aiming to implement a naive search in Elasticsearch that functions exactly like a Ctrl+F search. Specifically, a query like "rd1 wo" should match "word1 word2" because it is a partial match in sequence.

The search must return only the exact sequence of matching results and exclude everything else. While this behavior seems straightforward, achieving it within Elasticsearch has proven unexpectedly complex due to its design and behavior.

Despite the temptation to use simpler methods, the requirement is to implement this directly in Elasticsearch.

Key Requirements:

  1. Support partial matches (sub-word queries).
  2. Handle both single-word queries and multi-word phrases.
  3. Ensure sub-word queries work for sequences of two or more partial words.
  4. Exclude all non-matching results—no false positives allowed.

Currently, we're using a combination of match_phrase and an n-gram filter, followed by post-processing to filter out non-matching results. However, this approach is inefficient and not ideal.

Any guidance or solutions to achieve this in Elasticsearch would be greatly appreciated.

Thank you in advance,
Arie

Have you looked into using the wildcard field type together with a regexp query?

Yes, we've tried numerous combinations, but none have successfully met all the conditions.

What exactly did you try with the field type I pointed to and what about it did not work? I think it should meet most, if not all, of your stated requirements.

Please show mappings, sample document(s) and queries so we can recreate and get a better understanding of what is and is not working.

Awesome! I just tried it, and it works perfectly!
I hadn’t realized there was such a wildcard type for fields—I thought it was only available as a query type.
Thanks so much!

1 Like

I need further help with constructing the correct regular expression for a query that meets the following requirements:

a. Exact phrase with word boundaries:
The query should match the exact phrase, ensuring word boundaries are respected.
b. Partial phrase:
The query should also allow searching for partial phrases.

Both types of searches should have the option to decide whether they are sensitive to punctuation or not.

Tried a few ways, but had some problems.

You will need to provide examples that show what you are trying to do and what you have tried that is not working.

1. Exact Phrase Search with Word Boundaries:

  • Search Phrase: "exact phrase"
  • Matches:
    • "exact phrase" (exact match, with word boundaries)
    • "The exact phrase is here." (match in a sentence where the phrase appears with word boundaries)
    • "This is an exact phrase."
  • Does Not Match:
    • "exactphrases" (the words are not separated by boundaries)
    • "exact phraseology" (different words, even if they include part of the search phrase)

2. Partial Phrase Search:

  • Search Phrase: "exact"
  • Matches:
    • "exact phrase" (partial match of "exact" within the phrase)
    • "exactly what I wanted" (partial match of "exact")
    • "This is an exact match."
  • Does Not Match:
    • "no match here" (the search term "exact" is not found)

3. Punctuation Sensitivity:

When is disabled

  • Search Phrase: "exact. phrase"
  • Matches:
    • "exact, phrase"
    • "The exact. phrase!"

Is this what you want to match and not match or what currently happens?

What do you mean by this?

The wildcard field is not “sensitive to punctuation” by design. Like the “Ctrl F” behaviour you described - it searches everything.
If you want to allow for skipping punctuation, that would require a regex that lists the choices of permitted characters between words eg commas, carriage returns etc. Obviously several of these would need escaping to avoid clashes with the reserved characters commonly found in regular expressions like full stops.

I want to have for both exact or partial match the option to decide if it is sensitive to punctuation, It should be 4 separate queries:

  1. Exact sensitive to punc
  2. Exact not sensitive to punc
  3. Partial sensitive to punc
  4. Partial not sensitive to punc

Then I suspect you will need to alter how you rewrite the query string into a regexp based on the behaviour you desire?

That's right, I'm asking about that.

What I have tried:
`
def construct_regex_query(phrase, exact=True, punctuation_sensitive=True):
# Escape the input string to safely include it in the query
escaped_phrase = re.escape(phrase)

if punctuation_sensitive:
    # Exact match with punctuation sensitivity (using word boundaries)
    if exact:
        # Use \b for word boundaries to match the exact phrase, allowing other words around it
        return {
            "query": {
                "regexp": {
                    "text": {
                        "value": f".*\b{escaped_phrase}\b.*"  # Word boundary match
                    }
                }
            }
        }
    else:
        # Partial match with punctuation sensitivity (allowing wildcards in between)
        return {
            "query": {
                "regexp": {
                    "text": {
                        "value": f".*{escaped_phrase}.*"  # Partial match
                    }
                }
            }
        }
else:
    # Remove punctuation to make the search insensitive to punctuation
    phrase_without_punctuation = re.sub(r'[^\w\s]', '', phrase)
    escaped_phrase_no_punc = re.escape(phrase_without_punctuation)

    if exact:
        # Use word boundaries even after removing punctuation
        return {
            "query": {
                "regexp": {
                    "text": {
                        "value": f".*\b{escaped_phrase_no_punc}\b.*"  # Word boundary match without punctuation
                    }
                }
            }
        }
    else:
        return {
            "query": {
                "regexp": {
                    "text": {
                        "value": f".*{escaped_phrase_no_punc}.*"  # Partial match without punctuation
                    }
                }
            }
        }

`
The partial match appears to work as expected, but the exact match still doesn't function at all in this implementation.

I came across a StackOverflow answer (which I can't share due to link restrictions) that mentions \b can't be used in Elasticsearch, but the alternative suggestion also didn't work.

The regex syntax supported in Lucene (and therefore elasticsearch) is documented here:
https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/util/automaton/RegExp.html

Any unsupported expressions eg \b will not throw an error but search for that literal character, in this case the letter “b”

Can you help with correct equivalent?

This wouldn’t be the forum for general regex advice but this should give you some pointers regex101: build, test, and debug regex