Ctrl+F search behavior in elastic

Arie_Youlus · November 25, 2024, 1:32pm

We are aiming to implement a naive search in Elasticsearch that functions exactly like a Ctrl+F search. Specifically, a query like "rd1 wo" should match "word1 word2" because it is a partial match in sequence.

The search must return only the exact sequence of matching results and exclude everything else. While this behavior seems straightforward, achieving it within Elasticsearch has proven unexpectedly complex due to its design and behavior.

Despite the temptation to use simpler methods, the requirement is to implement this directly in Elasticsearch.

Key Requirements:

Support partial matches (sub-word queries).
Handle both single-word queries and multi-word phrases.
Ensure sub-word queries work for sequences of two or more partial words.
Exclude all non-matching results—no false positives allowed.

Currently, we're using a combination of match_phrase and an n-gram filter, followed by post-processing to filter out non-matching results. However, this approach is inefficient and not ideal.

Any guidance or solutions to achieve this in Elasticsearch would be greatly appreciated.

Thank you in advance,
Arie

Christian_Dahlqvist · November 25, 2024, 1:50pm

Have you looked into using the wildcard field type together with a regexp query?

Arie_Youlus · November 25, 2024, 2:07pm

Yes, we've tried numerous combinations, but none have successfully met all the conditions.

Christian_Dahlqvist · November 25, 2024, 2:10pm

What exactly did you try with the field type I pointed to and what about it did not work? I think it should meet most, if not all, of your stated requirements.

Please show mappings, sample document(s) and queries so we can recreate and get a better understanding of what is and is not working.

Arie_Youlus · November 25, 2024, 2:23pm

Awesome! I just tried it, and it works perfectly!
I hadn’t realized there was such a wildcard type for fields—I thought it was only available as a query type.
Thanks so much!

Arie_Youlus · November 26, 2024, 10:48am

I need further help with constructing the correct regular expression for a query that meets the following requirements:

a. Exact phrase with word boundaries:
The query should match the exact phrase, ensuring word boundaries are respected.
b. Partial phrase:
The query should also allow searching for partial phrases.

Both types of searches should have the option to decide whether they are sensitive to punctuation or not.

Tried a few ways, but had some problems.

Christian_Dahlqvist · November 26, 2024, 10:53am

You will need to provide examples that show what you are trying to do and what you have tried that is not working.

Arie_Youlus · November 26, 2024, 11:34am

1. Exact Phrase Search with Word Boundaries:

Search Phrase: "exact phrase"
Matches:
- "exact phrase" (exact match, with word boundaries)
- "The exact phrase is here." (match in a sentence where the phrase appears with word boundaries)
- "This is an exact phrase."
Does Not Match:
- "exactphrases" (the words are not separated by boundaries)
- "exact phraseology" (different words, even if they include part of the search phrase)

2. Partial Phrase Search:

Search Phrase: "exact"
Matches:
- "exact phrase" (partial match of "exact" within the phrase)
- "exactly what I wanted" (partial match of "exact")
- "This is an exact match."
Does Not Match:
- "no match here" (the search term "exact" is not found)

3. Punctuation Sensitivity:

When is disabled

Search Phrase: "exact. phrase"
Matches:
- "exact, phrase"
- "The exact. phrase!"

Christian_Dahlqvist · November 26, 2024, 12:30pm

Is this what you want to match and not match or what currently happens?

What do you mean by this?

Mark_Harwood1 · November 26, 2024, 12:33pm

The wildcard field is not “sensitive to punctuation” by design. Like the “Ctrl F” behaviour you described - it searches everything.
If you want to allow for skipping punctuation, that would require a regex that lists the choices of permitted characters between words eg commas, carriage returns etc. Obviously several of these would need escaping to avoid clashes with the reserved characters commonly found in regular expressions like full stops.

Arie_Youlus · November 26, 2024, 12:44pm

I want to have for both exact or partial match the option to decide if it is sensitive to punctuation, It should be 4 separate queries:

Exact sensitive to punc
Exact not sensitive to punc
Partial sensitive to punc
Partial not sensitive to punc

Christian_Dahlqvist · November 26, 2024, 12:46pm

Then I suspect you will need to alter how you rewrite the query string into a regexp based on the behaviour you desire?

Arie_Youlus · November 26, 2024, 12:47pm

That's right, I'm asking about that.

Arie_Youlus · November 26, 2024, 12:51pm

What I have tried:
`
def construct_regex_query(phrase, exact=True, punctuation_sensitive=True):
# Escape the input string to safely include it in the query
escaped_phrase = re.escape(phrase)

if punctuation_sensitive:
    # Exact match with punctuation sensitivity (using word boundaries)
    if exact:
        # Use \b for word boundaries to match the exact phrase, allowing other words around it
        return {
            "query": {
                "regexp": {
                    "text": {
                        "value": f".*\b{escaped_phrase}\b.*"  # Word boundary match
                    }
                }
            }
        }
    else:
        # Partial match with punctuation sensitivity (allowing wildcards in between)
        return {
            "query": {
                "regexp": {
                    "text": {
                        "value": f".*{escaped_phrase}.*"  # Partial match
                    }
                }
            }
        }
else:
    # Remove punctuation to make the search insensitive to punctuation
    phrase_without_punctuation = re.sub(r'[^\w\s]', '', phrase)
    escaped_phrase_no_punc = re.escape(phrase_without_punctuation)

    if exact:
        # Use word boundaries even after removing punctuation
        return {
            "query": {
                "regexp": {
                    "text": {
                        "value": f".*\b{escaped_phrase_no_punc}\b.*"  # Word boundary match without punctuation
                    }
                }
            }
        }
    else:
        return {
            "query": {
                "regexp": {
                    "text": {
                        "value": f".*{escaped_phrase_no_punc}.*"  # Partial match without punctuation
                    }
                }
            }
        }

`
The partial match appears to work as expected, but the exact match still doesn't function at all in this implementation.

Arie_Youlus · November 26, 2024, 12:54pm

I came across a StackOverflow answer (which I can't share due to link restrictions) that mentions \b can't be used in Elasticsearch, but the alternative suggestion also didn't work.

Mark_Harwood1 · November 26, 2024, 1:22pm

The regex syntax supported in Lucene (and therefore elasticsearch) is documented here:
https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/util/automaton/RegExp.html

Any unsupported expressions eg \b will not throw an error but search for that literal character, in this case the letter “b”

Arie_Youlus · November 26, 2024, 1:35pm

Can you help with correct equivalent?

Mark_Harwood1 · November 26, 2024, 5:24pm

This wouldn’t be the forum for general regex advice but this should give you some pointers regex101: build, test, and debug regex

Topic		Replies	Views
ElasticSearch - Searching partial text in String Elasticsearch	4	9924	November 13, 2022
Elastic Search Partial Search Elasticsearch	8	4414	December 3, 2018
ElasticSearch partial word search Elasticsearch	2	410	December 16, 2020
Search Exact and partial match Elasticsearch	17	8710	February 7, 2020
Exact search Elasticsearch	4	448	September 10, 2018

Ctrl+F search behavior in elastic

Key Requirements:

1. Exact Phrase Search with Word Boundaries:

2. Partial Phrase Search:

3. Punctuation Sensitivity:

Related topics