How to conditionally prepend text to E.164 format phone field

mikesparr · September 28, 2018, 5:08pm

I have an index with 650,000 contacts and recently we standardized to store them in E.164 format (i.e. +15555555555). There are still old records 500K+ that are in other formats and I'd like to clean the data and standardize them all to E.164. It's not clear the best approach...

What is the best way to update all phone fields that are 10 chars with E.164 + 1 format if they are not already formatted?

I performed _update_by_query first to strip all non-numeric values

POST contacts-test/_doc/_update_by_query
{
  "script": {
    "lang": "painless",
    "source": "if ( ctx._source.containsKey(\"phone_default\") ) { ctx._source.phone_default.value = /[^0-9]/.matcher(ctx._source.phone_default.value).replaceAll('') } else { ctx.op = \"noop\" }"
  }
}

I'm struggling, however, to find all fields that already have a +1 and omit them from an update, and how best to prepend the rest with +1. Furthermore, there may be some numbers that slipped in long ago that are not 10 characters in length, so I'd like to make sure they are 10 characters, add the +1, and perhaps set values < 10 chars to null.

I've configured the index with the following:

	"index_patterns": "contacts*",
	"settings": {
		"index": {
			"number_of_shards": 3,
			"number_of_replicas": 0
		},
		"analysis": {
			"char_filter": {
				"digits_only": {
					"type": "pattern_replace",
					"pattern": "[^\\d]"
				}
			},
			"filter": {
				"autocomplete_filter": {
					"type": "edge_ngram",
					"min_gram": 2,
					"max_gram": 20
				},
				"us_phone_number": {
					"type": "pattern_capture",
					"preserve_original": true,
					"patterns": [
            "1?(1)(\\d*)",
            "+1(\\d*)"
					]
				},
				"ten_digits_min": {
					"type": "length",
					"min": 10
				},
				"not_empty": {
					"type": "length",
					"min": 1
				}
			},
			"tokenizer": {
				"whitespace": {
					"type": "whitespace"
				}
			},
			"normalizer": {
				"lower_case": {
					"type": "custom",
					"char_filter": [],
					"filter": ["lowercase", "asciifolding"]
				}
			},
			"analyzer": {
				"autocomplete": {
					"type": "custom",
					"tokenizer": "standard",
					"filter": [
						"lowercase",
						"autocomplete_filter"
					]
				},
				"fingerprint_en": {
					"type": "fingerprint",
					"stopwords": "_english_"
				},
				"urls-links-emails": {
					"type": "custom",
					"tokenizer": "uax_url_email"
				},
				"address": {
					"type": "custom",
					"tokenizer": "whitespace",
					"filter": ["trim", "lowercase"]
				},
				"phone_number": {
					"char_filter": "digits_only",
					"tokenizer": "keyword",
					"filter": [
						"us_phone_number",
						"ten_digits_min"
					]
				},
				"phone_number_search": {
					"char_filter": "digits_only",
					"tokenizer": "keyword",
					"filter": [
						"not_empty"
					]
				}
			}
		}
	}

And mapped the field as such:

	"phone_default": {
		"properties": {
			"value": {
				"type": "text",
				"analyzer": "phone_number",
				"search_analyzer": "phone_number_search",
				"fields": {
					"raw": {
						"type": "keyword"
					}
				}
			}
		}
	}

system · October 26, 2018, 5:08pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Update 2 million record with a correct format of country prefix in phone number field Elasticsearch	3	467	July 27, 2020
Adding tokenizer/patternReplace on phoneNumber using NEST client Elasticsearch	1	326	September 21, 2020
Phone # search query Elasticsearch	2	407	July 5, 2017
Does Elastic Search support Mobile Number variation (ex: US Format that has special character) Elasticsearch	3	378	September 21, 2021
Search phone number Elasticsearch	6	9838	July 5, 2017

How to conditionally prepend text to E.164 format phone field

What is the best way to update all phone fields that are 10 chars with E.164 + 1 format if they are not already formatted?

Related topics