Ngram behavior vs wildcard field type

maratusa · April 20, 2022, 2:49am

Hi,

Please help me understand n-gram and wildcard field type behaviors. I am working on an application that offers a search by phone number. It will be a contain search. e.g. search for phone number containing "234890".
Our Elasticsearch index has close to 1 billion documents.

While looking into options, I came across wildcard field type which seems to fit our use case. We haven't done any performance tests yet.

wildcard field type uses 3-gram - so I wanted to test 3-gram and wildcard, but I have hard time understanding why my example below does not match my expectations.

Here is the example I am using:

PUT ngram-index
{
  "settings": {
    "index": {
      "number_of_shards": "2",
      "number_of_replicas": "1"
    },
    "analysis": {
      "analyzer": {
        "ngram": {
          "tokenizer": "ngram"
        }
      },
      "tokenizer": {
        "ngram": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 3
        }
      }
    }
  },
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "ngram_field": {
        "type": "text",
        "analyzer": "ngram"
      },
      "wildcard_field": {
        "type": "wildcard",
        "ignore_above": 25
      }
    }
  }
}


PUT /ngram-index/_doc/1
{
  "ngram_field": "1234567",
  "wildcard_field": "1234567"
}

PUT /ngram-index/_doc/2
{
  "ngram_field": "234890",
  "wildcard_field": "234890"
}

When I ran a search using the wildcard field I get the right results per my expectations which is document id=2

POST /ngram-index/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "wildcard": {
            "wildcard_field": {
              "value": "23489*"
            }
          }
        }
      ]
    }
  }
}

however when I search using 3-gram - I get no results. I was expecting both documents id=1 and id=2 to be returned because the 3-gram "234" exist in both documents.
can you please help me understand this behavior:

POST /ngram-index/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "wildcard": {
            "ngram_field": {
              "value": "23489*"
            }
          }
        }
      ]
    }
  }
}

Thanks,
Moulay

RabBit_BR · April 20, 2022, 8:53pm

Hi!

This happens because of your ngram.
The tokens for 234890 -> 234, 348, 489, 890

Its pattern is 23489*, the wildcard will not find a match because it doesn't have token 23489.

If you test with 234* or 23*, it will retrieve documents.

maratusa · April 21, 2022, 12:41am

I thought that the ngram analyzer - at both search and index times - will token 234890 into 234, 348, 489, 890 - hence I was expecting results. If that's not the case? how can I set up my search_analyzer to behave similarly to the index_analyzer?

One more thing that I am trying to understand, is why the wildcard field type is returning the correct results (when searching for 23489*)? should it behave similarly to the ngram analyzer as it's based on 3-gram tokens as well?

Mark_Harwood · April 21, 2022, 7:14am

The wildcard field has a ton of code in it to make wildcard and regex queries work as expected.
It first executes an approximation phase using the ngram index to accelerate queries (but only where appropriate) and then feeds candidate matches into a second validation phase that checks the wildcard/regex works on the original full doc value. Great care is taken to ensure the approximation phase uses an ngram query that eliminates as much as possible but without causing false negatives.

Just using an ngram index is not a substitute for this logic.

anime_lover · April 21, 2022, 11:10am

Hi there ,
the thing i understood is that you need a query to search phone number containing certain queried value ,right?

if yes then you can search using query_string
eg:

GET /index/_search
{
  "query": {
    "query_string": {
      "default_field": "mobile",
      "query": "*05793*"
    }
  }
}

I hope it might help you

maratusa · April 23, 2022, 12:44am

Can you help me understand why ngram are not tokenizing the search value at query time?
I was expecting my query to search for 234, 348, 489, 89*.

POST /ngram-index/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "wildcard": {
            "ngram_field": {
              "value": "23489*"
            }
          }
        }
      ]
    }
  }
}

Mark_Harwood · April 23, 2022, 4:03pm

Wildcard queries are part of the term-level family of queries whose docs state:

“Unlike full-text queries, term-level queries do not analyze search terms. Instead, term-level queries match the exact terms stored in a field”

maratusa · April 27, 2022, 4:45pm

Thanks Mark.

system · May 25, 2022, 4:46pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch Wildcard fieldtype has slow performance for wildcard queries Elasticsearch	5	2978	January 26, 2021
nGrams and Wildcards Elasticsearch	2	443	July 6, 2017
nGram and wildcards Elasticsearch	4	1591	July 6, 2017
Wildcard searches Elasticsearch	6	1507	July 31, 2018
Searching by ngrams Elasticsearch elastic-stack-monitoring	10	246	June 16, 2023

Ngram behavior vs wildcard field type

Related topics