Elasticsearch Query for Exact Substring Matching with Spaces

I'm trying to perform an exact substring match in Elasticsearch, including substrings that contain spaces. Here’s what I need:

  1. Search for an exact substring within a larger text field.
  2. The substring may contain spaces.
  3. The substring may be a partial word and not necessarily a full word.
  4. I want to match the substring exactly as it appears, not just individual terms.

I've tried the following approaches without success:

Wildcard Query:

{
  "query": {
    "wildcard": {
      "description": {
        "value": "*substring with spaces*",
        "case_insensitive": true
      }
    }
  }
}

Query String with Analyze Wildcard:

{
  "query": {
    "query_string": {
      "query": "*substring with spaces*",
      "fields": ["description"],
      "analyze_wildcard": true,
      "default_operator": "AND"
    }
  }
}

Neither approach returns the expected results when the substring contains spaces.

Example:

Field: description

Value: "quick brown fox"

Substring can be any of the followings - "quick brown fox", "ick br", "quick bro", "quick" or "brown f" etc and I still want it to match exactly.

How can I construct an Elasticsearch query to achieve exact substring matching, including spaces and partial words?

Welcome!

You could try the following strategy: N-gram tokenizer | Elasticsearch Guide [8.14] | Elastic

Thanks for the reply @dadoonet.
Would this still be helpful even if i dont know what may be the length of my description? I mean how can i decide on the min-max grams? The input can be just a letter(like just 'q') or it may be the whole sentence(like 'quick brown fox') or may be just some random substring (like 'own fo').

In any of the cases I would always want the exact matching and all records that are matched should be in the result set. I also dont want any scoring as such, just all the results which match the given substring.

Another solution is to use the Wildcard type which might be easier actually.

Seems like wildcard fields is not yet supported in the latest version of Opensearch as of now. Thought that opensearch should have the same types as of elasticsearch but it doesn't seem so yet.

OpenSearch/OpenDistro are AWS run products and differ from the original Elasticsearch and Kibana products that Elastic builds and maintains. You may need to contact them directly for further assistance. See What is OpenSearch and the OpenSearch Dashboard? | Elastic for more details.

(This is an automated response from your friendly Elastic bot. Please report this post if you have any suggestions or concerns :elasticheart: )

Oh? You are using an outdated fork?

You should switch to Elasticsearch 8.14.0.
It's complete, mature and secured.

And still free as it was before the fork, with even more features in the free tier.

@dadoonet I am trying to use ngram tokenizer as you mentioned withe following settigns:

"settings": {
    "index": {
      "max_ngram_diff": 14
    },
    "analysis": {
      "tokenizer": {
        "my_ngram_tokenizer": {
          "type": "ngram",
          "min_gram": 1,
          "max_gram": 15,
          "token_chars": [
            "letter"
          ]
        }
      },
      "analyzer": {
        "my_ngram_analyzer": {
          "type": "custom",
          "tokenizer": "my_ngram_tokenizer",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },

Now what's the best way to look for my exat substring? Is it like

GET index-new/_search
{
  "track_total_hits": true, 
  "query": {
   "match_phrase": {
     "description": "own f"
   }
  }
}

Or do you suggest any better way?

As far as I can see, this is exactly the use case the wildcard field type was designed for. It uses ngrams behind the scenes to reduce the number of candidates it need to check and then matches the full pattern.

To get the same effect using standard ngrams you will need to rewrite the query and then perform post-processing on the results to filter out false positives. This can be tricky, so I would recommend upgrading to the latest version of Elasticsearch instead.

1 Like