ElasticSearch - Searching partial text in String

What is the best way to use Elasticsearch to search exact partial text in String?

In SQL the method would be: %PARTIAL TEXT%, %ARTIAL TEX%

In Elastic Search current method being used:

{
    "query": {
        "match_phrase_prefix": {
             "name": "PARTIAL TEXT"
        }
    }
}

However, it breaks whenever you remove first and last character of string as shown below (No results found):

{
    "query": {
        "match_phrase_prefix": {
             "name": "ARTIAL TEX"
        }
    }
}

You probably is looking for a wildcard query.

Suppose a index like this, with one document:

PUT netflix_movie_title
{
    "mappings": {
        "properties": {
            "title": {
                "type": "keyword"
            }
        }
    }
}

POST netflix_movie_title/_doc
{
  "title": "The Hitchhiker's Guide to the Galaxy"
}

A wildcard query lets you perform a "SQL LIKE" like query.

Match start:

POST netflix_movie_title/_search
{
  "query": {
    "wildcard": {
      "title": {
        "value": "The Hitchhiker's Guide to th*"
      }
    }
  }
}

Match end:

POST netflix_movie_title/_search
{
  "query": {
    "wildcard": {
      "title": {
        "value": "*er's Guide to the Galaxy"
      }
    }
  }
}

Match middle:

POST netflix_movie_title/_search
{
  "query": {
    "wildcard": {
      "title": {
        "value": "*er's Guide to the Gal*"
      }
    }
  }
}

No match:

POST netflix_movie_title/_search
{
  "query": {
    "wildcard": {
      "title": {
        "value": "*er's Gui o the Gal*"
      }
    }
  }
}

However, the keyword field is designed to do exact match queries. If you needs full text search capabilities too, consider use a multifield mapping (Field data types | Elasticsearch Guide [8.4] | Elastic).

Running wildcard queries on keyword fields has two problems:

  1. It wont work on large values
  2. the search cost is linear with the number of unique values

That’s why the wildcard field was created and this blog gives the background. This too has shortcomings because the search cost is linear with the number of docs that hold a value that roughly matches the search.
There’s always some kind of performance trade off.

Totally agree, performance issues should be considered on this use case. The blog post above has a guide to choose data type. If you not sure what data type to use. A possible approach is create a new sample index to explore your data using the desired type, and use the reindex API to reindex part or the whole production index.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.