How to conditionally prepend text to E.164 format phone field


(Mike Sparr) #1

I have an index with 650,000 contacts and recently we standardized to store them in E.164 format (i.e. +15555555555). There are still old records 500K+ that are in other formats and I'd like to clean the data and standardize them all to E.164. It's not clear the best approach...

What is the best way to update all phone fields that are 10 chars with E.164 + 1 format if they are not already formatted?

I performed _update_by_query first to strip all non-numeric values

POST contacts-test/_doc/_update_by_query
{
  "script": {
    "lang": "painless",
    "source": "if ( ctx._source.containsKey(\"phone_default\") ) { ctx._source.phone_default.value = /[^0-9]/.matcher(ctx._source.phone_default.value).replaceAll('') } else { ctx.op = \"noop\" }"
  }
}

I'm struggling, however, to find all fields that already have a +1 and omit them from an update, and how best to prepend the rest with +1. Furthermore, there may be some numbers that slipped in long ago that are not 10 characters in length, so I'd like to make sure they are 10 characters, add the +1, and perhaps set values < 10 chars to null.

I've configured the index with the following:

	"index_patterns": "contacts*",
	"settings": {
		"index": {
			"number_of_shards": 3,
			"number_of_replicas": 0
		},
		"analysis": {
			"char_filter": {
				"digits_only": {
					"type": "pattern_replace",
					"pattern": "[^\\d]"
				}
			},
			"filter": {
				"autocomplete_filter": {
					"type": "edge_ngram",
					"min_gram": 2,
					"max_gram": 20
				},
				"us_phone_number": {
					"type": "pattern_capture",
					"preserve_original": true,
					"patterns": [
            "1?(1)(\\d*)",
            "+1(\\d*)"
					]
				},
				"ten_digits_min": {
					"type": "length",
					"min": 10
				},
				"not_empty": {
					"type": "length",
					"min": 1
				}
			},
			"tokenizer": {
				"whitespace": {
					"type": "whitespace"
				}
			},
			"normalizer": {
				"lower_case": {
					"type": "custom",
					"char_filter": [],
					"filter": ["lowercase", "asciifolding"]
				}
			},
			"analyzer": {
				"autocomplete": {
					"type": "custom",
					"tokenizer": "standard",
					"filter": [
						"lowercase",
						"autocomplete_filter"
					]
				},
				"fingerprint_en": {
					"type": "fingerprint",
					"stopwords": "_english_"
				},
				"urls-links-emails": {
					"type": "custom",
					"tokenizer": "uax_url_email"
				},
				"address": {
					"type": "custom",
					"tokenizer": "whitespace",
					"filter": ["trim", "lowercase"]
				},
				"phone_number": {
					"char_filter": "digits_only",
					"tokenizer": "keyword",
					"filter": [
						"us_phone_number",
						"ten_digits_min"
					]
				},
				"phone_number_search": {
					"char_filter": "digits_only",
					"tokenizer": "keyword",
					"filter": [
						"not_empty"
					]
				}
			}
		}
	}

And mapped the field as such:

	"phone_default": {
		"properties": {
			"value": {
				"type": "text",
				"analyzer": "phone_number",
				"search_analyzer": "phone_number_search",
				"fields": {
					"raw": {
						"type": "keyword"
					}
				}
			}
		}
	}

(system) #2

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.