Querying for fragment of string intersecting 2 words

orokusaki · July 5, 2019, 3:38pm

I have a document with a name field with the following made-up address value:

0615 Zaxsnzfugwbb Street, Obgnqjua, KI 22304

I'm testing Elasticsearch for functionality and performance at scale (adding 6 million property documents to simulate twice our current scale at work).

Besides basic search (taking full advantage of Elasticsearch's search engine features, etc.), we also need to be able to find a document by an exact subset of a field, due to the way our customers embed encoded information in the name fields that they later use for searching.

In order to test this sub-string search need I used (looking for the above name value):

...
    'bool': {
        'should': [
            {'wildcard': {'name': '*zfugwbb Stre*'}}
        ]
    }

The bool, should, etc. is because this is part of a larger query that includes other fields.

I expected it to return 0615 Zaxsnzfugwbb Street, Obgnqjua, KI 22304, but it didn't.

When I use just the *zfugwbb* portion it returns the above document. So does Zaxsnzfugwbb Street. This indicates to me that the system is only searching for a wildcard value that spans multiple words (tokens) using whole tokens.

Is that a correct understanding? And, if so, is the only solution to accomplish what I'm trying to accomplish to use a regexp search instead of wildcard? Also, is this all a really bad idea (from a performance perspective, etc.). Finally, is it even possible with regexp (noting Regexp and case insensitive, which seems to preclude the ability to case-insensitive search this way)?

abdon · July 5, 2019, 5:30pm

I'm not sure how much time you've spent optimizing Elasticsearch' mappings? Have you applied any custom mappings to the index that you're querying? If not, Elasticsearch will apply a dynamic mapping, in which a field like name will be analyzed. This means that by default the value of name will be lowercased and broken up on (amongst others) the whitespace and punctuation in that field, so you can search for the individual parts 0615 , zaxsnzfugwbb , street, obgnqjua, ki and 22304 .

However, if you use a wildcard query, then that query wildcard string is not analyzed. Elasticsearch will try to find a document that contains a name field that matched the exact pattern *zfugwbb Stre* , which it won't, because for your document that value was chopped up.

Luckily there is a solution here. By default, Elasticsearch will create a second field that you can query, when you index a string field: in your case named name.keyword . This value contains the exact original string, without any text analysis applied to it. If you are going with the default mappings in Elasticsearch, you will want to query that field instead:

{
  "query": {
    "wildcard": {
      "name.keyword": "*zfugwbb Stre*"
    }
  }
}

orokusaki · July 5, 2019, 6:14pm

Hi @abdon - firstly, thanks SO MUCH for taking the time to help me!

I'm very excited to see that there is even this possibility (to query the original field). However, I tried to use that in the wildcard search and still go the same results.

As background, I'm not using any custom analyzers, and the name field is just marked as text in my mapping. Also, I'm on Elasticsearch 6.7.

Here are the mappings (expressed in a Python dict):

'properties': {
    'company_id': {'type': 'integer'},
    'id': {'type': 'integer'},
    'name': {'type': 'text'},
    'mls': {'type': 'text'},
    'buyer': {'type': 'text'},
    'seller': {'type': 'text'},
    'agents': {
        'type': 'nested',
        'properties': {
            'id': {'type': 'integer'},
            'email': {'type': 'text'},
            'first_name': {'type': 'text'},
            'last_name': {'type': 'text'},
            'name': {'type': 'text'}
        }
    }
}

I've included the entire query in both examples below, just for completion purposes, but of course only the wildcard name field portion is relevant.

Query (uses {'wildcard': {'name.keyword': '*zaxsnzfugwbb street*'}):

{'query': {'bool': {'filter': [{'match': {'company_id': 1532}}, {'bool': {'should': [{'wildcard': {'name.keyword': '*zaxsnzfugwbb street*'}}, {'match_phrase': {'name': 'zaxsnzfugwbb street'}}, {'wildcard': {'mls': '*zaxsnzfugwbb street*'}}, {'match_phrase': {'mls': 'zaxsnzfugwbb street'}}, {'wildcard': {'buyer': '*zaxsnzfugwbb street*'}}, {'match_phrase': {'buyer': 'zaxsnzfugwbb street'}}, {'wildcard': {'seller': '*zaxsnzfugwbb street*'}}, {'match_phrase': {'seller': 'zaxsnzfugwbb street'}}, {'nested': {'path': 'agents', 'query': {'bool': {'should': [{'bool': {'filter': [{'wildcard': {'agents.first_name': 'zaxsnzfugwbb*'}}, {'wildcard': {'agents.last_name': '*street'}}]}}, {'bool': {'filter': [{'wildcard': {'agents.first_name': 'zaxsnzfugwbb*'}}, {'wildcard': {'agents.last_name': 'street*'}}]}}, {'bool': {'filter': [{'wildcard': {'agents.first_name': 'zaxsnzfugwbb*'}}, {'match_phrase': {'agents.last_name': 'street'}}]}}, {'bool': {'filter': [{'match_phrase': {'agents.first_name': 'zaxsnzfugwbb'}}, {'wildcard': {'agents.last_name': 'street*'}}]}}, {'bool': {'filter': [{'match_phrase': {'agents.first_name': 'zaxsnzfugwbb'}}, {'wildcard': {'agents.last_name': '*street'}}]}}, {'bool': {'filter': [{'match_phrase': {'agents.first_name': 'zaxsnzfugwbb'}}, {'match_phrase': {'agents.last_name': 'street'}}]}}]}}}}]}}]}}}

Results:

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': '76602400',
    '_index': 'tx-general',
    '_routing': '1532',
    '_score': 0.0,
    '_source': {'agents': [{'email': 'clzfvygaigl@gmail.com',
       'first_name': 'Agnebc',
       'id': 37451,
       'last_name': 'Wudtfmjy'},
      {'email': 'lkfbtywhilmmqiygizf@gmail.com',
       'first_name': 'Vqbmmb',
       'id': 39463,
       'last_name': 'Hpoiwiyt'}],
     'buyer': 'Vqfslk Fwdfxzzw',
     'company_id': 1532,
     'id': 76602400,
     'mls': 'RX-88972457/Adhilyabzbjq',
     'name': '0615 Zaxsnzfugwbb Street, Obgnqjua, KI 22304',
     'seller': 'Plczti Witdmgck'},
    '_type': 'tx'}],
  'max_score': 0.0,
  'total': 1},
 'timed_out': False,
 'took': 12}

Query (uses {'wildcard': {'name.keyword': '*xsnzfugwbb stre*'}):

{'query': {'bool': {'filter': [{'match': {'company_id': 1532}}, {'bool': {'should': [{'wildcard': {'name.keyword': '*xsnzfugwbb stre*'}}, {'match_phrase': {'name': 'xsnzfugwbb stre'}}, {'wildcard': {'mls': '*xsnzfugwbb stre*'}}, {'match_phrase': {'mls': 'xsnzfugwbb stre'}}, {'wildcard': {'buyer': '*xsnzfugwbb stre*'}}, {'match_phrase': {'buyer': 'xsnzfugwbb stre'}}, {'wildcard': {'seller': '*xsnzfugwbb stre*'}}, {'match_phrase': {'seller': 'xsnzfugwbb stre'}}, {'nested': {'path': 'agents', 'query': {'bool': {'should': [{'bool': {'filter': [{'wildcard': {'agents.first_name': 'xsnzfugwbb*'}}, {'wildcard': {'agents.last_name': '*stre'}}]}}, {'bool': {'filter': [{'wildcard': {'agents.first_name': 'xsnzfugwbb*'}}, {'wildcard': {'agents.last_name': 'stre*'}}]}}, {'bool': {'filter': [{'wildcard': {'agents.first_name': 'xsnzfugwbb*'}}, {'match_phrase': {'agents.last_name': 'stre'}}]}}, {'bool': {'filter': [{'match_phrase': {'agents.first_name': 'xsnzfugwbb'}}, {'wildcard': {'agents.last_name': 'stre*'}}]}}, {'bool': {'filter': [{'match_phrase': {'agents.first_name': 'xsnzfugwbb'}}, {'wildcard': {'agents.last_name': '*stre'}}]}}, {'bool': {'filter': [{'match_phrase': {'agents.first_name': 'xsnzfugwbb'}}, {'match_phrase': {'agents.last_name': 'stre'}}]}}]}}}}]}}]}}}

Results:

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [], 'max_score': None, 'total': 0},
 'timed_out': False,
 'took': 54}

The first example uses the whole Zaxsnzfugwbb and Street words.

The second example just removes the first 2 letters from Zaxsnzfugwbb and the last 2 letters from Street to simulate that scenario I described.

abdon · July 5, 2019, 8:58pm

Alright, it looks like you do have a custom mapping - one in which the keyword fields have been disabled. That means you have disabled the capability to query the original strings.

By default, the mapping for the name field would look like this:

      "name": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      }

This would give you the capability to do wildcard queries on the name.keyword field.

You've got two options here. One is to recreate your index from scratch with a mapping like above for those string fields that you want to run wildcard queries against.

Alternatively, you can also add the name.keyword multifield to the mapping for your existing index and run an update by query on that index to populate that new multifield.

Step 1, update your mapping:

PUT my_index/_mapping/_doc
{
  "properties": {
    "name": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword"
        }
      }
    }
  }
}

(replace my_index with your index name and _doc with your type name)

Step 2, update all documents to populate this new keyword multifield:

POST my_index/_update_by_query

(this is going to take a while, as Elasticsearch will basically be rewriting all your 6M documents)

Now, you should be able to query that name.keyword field.

abdon · July 5, 2019, 9:10pm

Actually, I responded a bit too quickly. I missed that you needed to have case-insensitive wildcard queries. Things get a little bit more complicated, as you will need to apply a normalizer that lowercases the strings to the name field.

The steps to update the index would be:

Step 1: add a normalizer to your index settings:

POST my_index/_close

PUT my_index/_settings
{
  "analysis": {
    "normalizer": {
      "my_normalizer": {
        "type": "custom",
        "filter": [
          "lowercase"
        ]
      }
    }
  }
}

POST my_index/_open

Step 2, update the mapping, but now including the normalizer:

PUT my_index/_mapping/_doc
{
  "properties": {
    "name": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword",
          "normalizer": "my_normalizer"
        }
      }
    }
  }
}

Step 3, run update by query:

POST my_index/_update_by_query

orokusaki · July 6, 2019, 5:00pm

Hey @abdon - thanks once again for all the help. Since the data was just test data, I deleted my index and created a new one with the following settings:

{
    'tx-general': {
        'settings': {
            'index': {
                'analysis': {
                    'normalizer': {
                        'lowerizer': {
                            'filter': ['lowercase'],
                            'type': 'custom'
                        }
                    }
                },
                'creation_date': '1562426665914',
                'number_of_replicas': '1',
                'number_of_shards': '6',
                'provided_name': 'tx-general',
                'uuid': '-pu93WMiTY-2hcWuJC6zsw',
                'version': {'created': '6070099'}
            }
        }
    }
}

My doc's mappings are as follows:

{
    'tx-general': {
        'mappings': {
            'tx': {
                'properties': {
                    'agents': {
                        'properties': {
                            'email': {
                                'fields': {
                                    'keyword': {
                                        'normalizer': 'lowerizer',
                                        'type': 'keyword'
                                    }
                                },
                                'type': 'text'
                            },
                            'first_name': {
                                'fields': {
                                    'keyword': {
                                        'normalizer': 'lowerizer',
                                        'type': 'keyword'
                                    }
                                },
                                'type': 'text'
                            },
                            'id': {
                                'type': 'integer'
                            },
                            'last_name': {
                                'fields': {
                                    'keyword': {
                                        'normalizer': 'lowerizer',
                                        'type': 'keyword'
                                    }
                                },
                                'type': 'text'
                            }
                        },
                        'type': 'nested'
                    },
                    'buyer': {
                        'fields': {
                            'keyword': {
                                'normalizer': 'lowerizer',
                                'type': 'keyword'
                            }
                        },
                        'type': 'text'
                    },
                    'company_id': {
                        'type': 'integer'
                    },
                    'id': {
                        'type': 'integer'
                    },
                    'mls': {
                        'fields': {
                            'keyword': {
                                'normalizer': 'lowerizer',
                                'type': 'keyword'
                            }
                        },
                        'type': 'text'
                    },
                    'name': {
                        'fields': {
                            'keyword': {
                                'normalizer': 'lowerizer',
                                'type': 'keyword'
                            }
                        },
                        'type': 'text'
                    },
                    'seller': {
                        'fields': {
                            'keyword': {
                                'normalizer': 'lowerizer',
                                'type': 'keyword'
                            }
                        },
                        'type': 'text'
                    }
                }
            }
        }
    }
}

(note: I used the custom lowerizer analyzer for all text fields' keyword fields, since we run similar exact-match searches on all these fields at times)

And, everything is working perfectly! Thanks again for all your help. You have no idea how much of a help this is. We're up against some serious scale issues at work, and Elasticsearch is looking pretty promising as the solution. Without your help, I might have been spinning my wheels for days.

Topic		Replies	Views
Query on multiple fields Elasticsearch	6	682	October 11, 2012
Exact Search on multiple wildcard-fields Elasticsearch	22	8117	August 8, 2018
Query string with wildcards not working as (I) expect Elasticsearch	8	3021	May 2, 2012
Case-insensitive search of a substring in many (or all) fields Elasticsearch	7	4840	April 13, 2020
Search multiple words using wildcard query Elasticsearch	5	4397	May 11, 2021

Querying for fragment of string intersecting 2 words

Related topics