Ngram behavior vs wildcard field type

Hi,

Please help me understand n-gram and wildcard field type behaviors. I am working on an application that offers a search by phone number. It will be a contain search. e.g. search for phone number containing "234890".
Our Elasticsearch index has close to 1 billion documents.

While looking into options, I came across wildcard field type which seems to fit our use case. We haven't done any performance tests yet.

wildcard field type uses 3-gram - so I wanted to test 3-gram and wildcard, but I have hard time understanding why my example below does not match my expectations.

Here is the example I am using:

PUT ngram-index
{
  "settings": {
    "index": {
      "number_of_shards": "2",
      "number_of_replicas": "1"
    },
    "analysis": {
      "analyzer": {
        "ngram": {
          "tokenizer": "ngram"
        }
      },
      "tokenizer": {
        "ngram": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 3
        }
      }
    }
  },
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "ngram_field": {
        "type": "text",
        "analyzer": "ngram"
      },
      "wildcard_field": {
        "type": "wildcard",
        "ignore_above": 25
      }
    }
  }
}


PUT /ngram-index/_doc/1
{
  "ngram_field": "1234567",
  "wildcard_field": "1234567"
}

PUT /ngram-index/_doc/2
{
  "ngram_field": "234890",
  "wildcard_field": "234890"
}

When I ran a search using the wildcard field I get the right results per my expectations which is document id=2

POST /ngram-index/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "wildcard": {
            "wildcard_field": {
              "value": "23489*"
            }
          }
        }
      ]
    }
  }
}

however when I search using 3-gram - I get no results. I was expecting both documents id=1 and id=2 to be returned because the 3-gram "234" exist in both documents.
can you please help me understand this behavior:

POST /ngram-index/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "wildcard": {
            "ngram_field": {
              "value": "23489*"
            }
          }
        }
      ]
    }
  }
}

Thanks,
Moulay

Hi!

This happens because of your ngram.
The tokens for 234890 -> 234, 348, 489, 890

Its pattern is 23489*, the wildcard will not find a match because it doesn't have token 23489.

If you test with 234* or 23*, it will retrieve documents.

I thought that the ngram analyzer - at both search and index times - will token 234890 into 234, 348, 489, 890 - hence I was expecting results. If that's not the case? how can I set up my search_analyzer to behave similarly to the index_analyzer?

One more thing that I am trying to understand, is why the wildcard field type is returning the correct results (when searching for 23489*)? should it behave similarly to the ngram analyzer as it's based on 3-gram tokens as well?

The wildcard field has a ton of code in it to make wildcard and regex queries work as expected.
It first executes an approximation phase using the ngram index to accelerate queries (but only where appropriate) and then feeds candidate matches into a second validation phase that checks the wildcard/regex works on the original full doc value. Great care is taken to ensure the approximation phase uses an ngram query that eliminates as much as possible but without causing false negatives.

Just using an ngram index is not a substitute for this logic.

Hi there ,
the thing i understood is that you need a query to search phone number containing certain queried value ,right?

if yes then you can search using query_string
eg:

GET /index/_search
{
  "query": {
    "query_string": {
      "default_field": "mobile",
      "query": "*05793*"
    }
  }
}

I hope it might help you

Can you help me understand why ngram are not tokenizing the search value at query time?
I was expecting my query to search for 234, 348, 489, 89*.

POST /ngram-index/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "wildcard": {
            "ngram_field": {
              "value": "23489*"
            }
          }
        }
      ]
    }
  }
}
1 Like

Wildcard queries are part of the term-level family of queries whose docs state:

“Unlike full-text queries, term-level queries do not analyze search terms. Instead, term-level queries match the exact terms stored in a field”

1 Like

Thanks Mark.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.