Search phone number

nik9000 · October 22, 2015, 2:16pm

I think I'd try to use pattern capture token filter to extract the things you want to match. In the case you mention you'd want to strip the leading 7. I don't know Russia's number resolution rules, but for nanpa I'd use something like

"phone_number" : {
  "type" : "pattern_capture",
  "preserve_original" : 1,
  "patterns" : [
    "1(\\d{3}(\\d+))"
  ]
}

You apply phone_number analyzer as the index_analyzer and just use a keyword analyzer that strips +-() at search time. Or strip in your application. The index_analyzer here would index a number like 19195557321 as 19195557321, 9195557321, and 5557321 which matches the way phone numbers are resolved in nanpa. A user searching for 5557321 would get all the numbers ending in 5557321 - 19195557321, 13215557321, etc.

I'd also strip all the +-() stuff from the numbers before indexing them in elasticsearch. You don't want them in the _source because they don't add anything.

I once worked for a phone company so I've thought a lot about phone numbers.

BTW - this is a tradeoff. When you get new resolution rules you have to change the mapping and reindex the whole index. If you moved term expansion that the analyzer is doing outside of elasticsearch then you could be more surgical when the patterns change. I'd suggest doing something like that if you had to cover the whole world. So you'd index

{
  "phone_number": {
    "raw": "19195557321",
    "expansions": ["1919557321", "9195557321", "5557321"]
  }
}

and you'd search on phone_number.expansions.

Which solution you take is all a matter of how big of a deal this is for you. @Mark_Harwood's solution is perfectly reasonable for lots of applications. Its certainly simpler.