I cannot search by telephone (part)

mg85 · May 13, 2023, 4:23pm

Im trying to create an autocomplete, this is my index creation:

curl -X PUT "localhost:9200/backoffice_clients-com" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "type": "custom", 
          "tokenizer": "standard",
          "filter": ["lowercase", "asciifolding"]
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "id": { "type": "keyword" },
        "backofficeclinic_id": { "type": "keyword" },
        "fullname": { "type": "text" },
        "email": { "type": "text" },
        "idCard": { "type": "text" },
        "telephone": { "type": "text" }
      }
    }
  }
}'

and this is my search method

public function search(string $backofficeclinic_id, string $query): array
{
    $params = [
        'index' => 'backoffice_clients-com',
        'body'  => [
            'query' => [
                'bool' => [
                    'must' => [
                        ['term' => ['backofficeclinic_id' => $backofficeclinic_id]],
                        [
                            'multi_match' => [
                                'query' => $query,
                                'type' => 'phrase_prefix',
                                'fields' => ['fullname', 'email', 'idCard', 'telephone']
                            ]
                        ]
                    ]
                ]
            ],
            'sort' => [
                ['_score' => ['order' => 'desc']],
            ],
            'explain' => true,  // Add explanation of the scoring
        ]
    ];

    $response = $this->ESclient->search($params);

    foreach ($response['hits']['hits'] as $hit) {
        echo 'Document ID: ', $hit['_id'], PHP_EOL;
        echo 'Score: ', $hit['_score'], PHP_EOL;
        echo 'Explanation: ', json_encode($hit['_explanation']), PHP_EOL;
        echo 'Source: ', json_encode($hit['_source']), PHP_EOL;
    }

    return array_map(function ($hit) {
        return $hit['_source'];
    }, $response['hits']['hits']);
}

Im trying with these 2 documents:

array:2 [
  0 => array:6 [
    "id" => "be7dcc65-1876-4bc1-9a01-dc9183d79e1d"
    "backofficeclinic_id" => "dcf02d56-5240-43fb-9ba0-1ac3eafafa79"
    "fullname" => "Mikel Good"
    "email" => "mikel@gmail.com"
    "idCard" => null
    "telephone" => "+34661422181"
  ]
  1 => array:6 [
    "id" => "816c7cc6-965c-446b-b20f-8f1218757d73"
    "backofficeclinic_id" => "dcf02d56-5240-43fb-9ba0-1ac3eafafa79"
    "fullname" => "Mikel Bad"
    "email" => "mikel@gmail.com"
    "idCard" => "7777"
    "telephone" => "+34661422182"
  ]

If I use these keywords:

"mikel" returns 2 documents
"mikel g" return 1 document (be7dcc65-1876-4bc1-9a01-dc9183d79e1d)
"mikel b" return 1 document (816c7cc6-965c-446b-b20f-8f1218757d73)

the problem occur when I try to find by using telephone, if I use "661" as keyword, 0 docs are being returned and it should return 2 documents..., I don't know what is need to be changed to fix my code.

Thank you very much for your time.
Th

Priscilla_Parodi · June 7, 2023, 7:21pm

Hello @mg85,

You are using a standard tokenizer, so it's dividing the text into terms based on word boundaries. It removes most punctuation symbols. With it, when you perform a full-text search, the terms in the query string can be looked up individually.

Let's consider the telephone field.

As you can see here:

POST _analyze
{
  "analyzer": "standard",
  "text":     "+34661422182"
}

{
  "tokens": [
    {
      "token": "34661422182",
      "start_offset": 1,
      "end_offset": 12,
      "type": "<NUM>",
      "position": 0
    }
  ]
}

Only one token was created:

"token": "34661422182".

For the full name "Mikel Good":

POST _analyze
{
  "analyzer": "standard",
  "text":     "Mikel Good"
}

{
  "tokens": [
    {
      "token": "mikel",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "good",
      "start_offset": 6,
      "end_offset": 10,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

We have two tokens:

"token": "mikel"
"token": "good"

In this case, when you search for "Mikel" you can find a match. But if it were "mikelg" or "mik" it wouldn't work.

So, as for your "telephone" field you have this "token": "34661422182", it’s similar.

It will work for:

GET /backoffice_clients-com/_search
{
  "query": {
          "match": {
            "telephone": "34661422182"
          }
  }
}

OR

GET /backoffice_clients-com/_search
{
  "query": {
          "match": {
            "telephone": "+34661422182"
          }
  }
}

But not for "telephone": "3466142218".

To solve this, I can think of two options:

You can consider using an Edge n-gram token filter to produce more tokens by considering a specified length from the beginning of a token.

For example, "3466142218" can produce "3", "34", "346", "3466" and so on. Then, you can use a simple match query to search for the telephone considering these tokens.

Alternatively, you can consider using a Wildcard query, which would be something like:

GET /backoffice_clients-com/_search
{
  "query": {
          "wildcard": {
            "telephone": "*661*"
          }
  }
}

This allows you to consider other/more characters (*) before and after the query string.

Hope it helps!

system · July 5, 2023, 7:21pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.