Elasticsearch Query for Exact Substring Matching with Spaces

Umang_Kamdar · June 11, 2024, 6:12am

I'm trying to perform an exact substring match in Elasticsearch, including substrings that contain spaces. Here’s what I need:

Search for an exact substring within a larger text field.
The substring may contain spaces.
The substring may be a partial word and not necessarily a full word.
I want to match the substring exactly as it appears, not just individual terms.

I've tried the following approaches without success:

Wildcard Query:

{
  "query": {
    "wildcard": {
      "description": {
        "value": "*substring with spaces*",
        "case_insensitive": true
      }
    }
  }
}

Query String with Analyze Wildcard:

{
  "query": {
    "query_string": {
      "query": "*substring with spaces*",
      "fields": ["description"],
      "analyze_wildcard": true,
      "default_operator": "AND"
    }
  }
}

Neither approach returns the expected results when the substring contains spaces.

Example:

Field: description

Value: "quick brown fox"

Substring can be any of the followings - "quick brown fox", "ick br", "quick bro", "quick" or "brown f" etc and I still want it to match exactly.

How can I construct an Elasticsearch query to achieve exact substring matching, including spaces and partial words?

dadoonet · June 11, 2024, 7:20am

Welcome!

You could try the following strategy: N-gram tokenizer | Elasticsearch Guide [8.14] | Elastic

Umang_Kamdar · June 11, 2024, 7:54am

Thanks for the reply @dadoonet.
Would this still be helpful even if i dont know what may be the length of my description? I mean how can i decide on the min-max grams? The input can be just a letter(like just 'q') or it may be the whole sentence(like 'quick brown fox') or may be just some random substring (like 'own fo').

In any of the cases I would always want the exact matching and all records that are matched should be in the result set. I also dont want any scoring as such, just all the results which match the given substring.

dadoonet · June 11, 2024, 9:33am

Another solution is to use the Wildcard type which might be easier actually.

Umang_Kamdar · June 11, 2024, 6:44pm

Seems like wildcard fields is not yet supported in the latest version of Opensearch as of now. Thought that opensearch should have the same types as of elasticsearch but it doesn't seem so yet.

system · June 11, 2024, 6:44pm

OpenSearch/OpenDistro are AWS run products and differ from the original Elasticsearch and Kibana products that Elastic builds and maintains. You may need to contact them directly for further assistance. See What is OpenSearch and the OpenSearch Dashboard? | Elastic for more details.

(This is an automated response from your friendly Elastic bot. Please report this post if you have any suggestions or concerns )

dadoonet · June 11, 2024, 8:34pm

Oh? You are using an outdated fork?

You should switch to Elasticsearch 8.14.0.
It's complete, mature and secured.

And still free as it was before the fork, with even more features in the free tier.

Umang_Kamdar · June 12, 2024, 8:55am

@dadoonet I am trying to use ngram tokenizer as you mentioned withe following settigns:

"settings": {
    "index": {
      "max_ngram_diff": 14
    },
    "analysis": {
      "tokenizer": {
        "my_ngram_tokenizer": {
          "type": "ngram",
          "min_gram": 1,
          "max_gram": 15,
          "token_chars": [
            "letter"
          ]
        }
      },
      "analyzer": {
        "my_ngram_analyzer": {
          "type": "custom",
          "tokenizer": "my_ngram_tokenizer",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },

Now what's the best way to look for my exat substring? Is it like

GET index-new/_search
{
  "track_total_hits": true, 
  "query": {
   "match_phrase": {
     "description": "own f"
   }
  }
}

Or do you suggest any better way?

Christian_Dahlqvist · June 12, 2024, 9:09am

As far as I can see, this is exactly the use case the wildcard field type was designed for. It uses ngrams behind the scenes to reduce the number of candidates it need to check and then matches the full pattern.

To get the same effect using standard ngrams you will need to rewrite the query and then perform post-processing on the results to filter out false positives. This can be tricky, so I would recommend upgrading to the latest version of Elasticsearch instead.

Topic		Replies	Views
Exact Sub-String Match \| ElasticSearch Elasticsearch	4	3845	December 9, 2019
Wildcard search with space in the text Elasticsearch	11	8404	February 20, 2020
Search string with space in a long text Elasticsearch	11	23282	December 20, 2018
Partial substring search that contains whitespaces or special chars like file contains "1 - document" doesnt work Elasticsearch	2	427	December 31, 2021
Search for name(text) with spaces Elasticsearch	2	363	July 6, 2017

Elasticsearch Query for Exact Substring Matching with Spaces

Related topics