Search by phone number

Ian_Eure · July 15, 2011, 6:09pm

I'm having a hard time matching documents with a phone number, and I'm not sure what's going on.

Some of my data has phone numbers separated with spaces:

"phone": "+1 415 931 1182",

Others have them with nothing but the numbers:

"phone": "4159311182",

This field is put into _all, which has the default filters and analyzers. When I search for the former, I get no results, but searches for the latter match just fine. This is the query I'm using:

"query": {
    "sort": [
        {
            "_score": "desc"
        }
    ], 
    "from": 0, 
    "fields": [
        "_source"
    ], 
    "explain": true, 
    "query": {
        "dis_max": {
            "queries": [
                {
                    "term": {
                        "_all": "415 931 1182"
                    }
                }
            ]
        }
    }, 
    "size": 25
},

It returns zero matches. I suspect that the tokens for the numbers are getting filtered out, but I see nothing in the docs to indicate that this would take place. I haven't been able to figure out how to see what tokens are actually generated for a document, so I have no way to confirm this hypothesis.

If anyone has any guidance, it would be very much appreciated.

Clinton_Gormley · July 15, 2011, 6:20pm

Hi Ian

Some of my data has phone numbers separated with spaces:

"phone": "+1 415 931 1182",

Others have them with nothing but the numbers:

"phone": "4159311182",

You don't mention what mapping the 'phone' field has. If you're relying
on the defaults, then it will depend on which version was mapped first,
If the first, then it will be string. If the second, it will be integer
(which is, by default) not analyzed.

Analysis would also remove the '+'

This field is put into _all, which has the default filters and
analyzers.

...which is analyzed by default

When I search for the former, I get no results, but searches for the
latter match just fine. This is the query I'm using:

"query": {
    "sort": [
        {
            "_score": "desc"
        }
    ], 
    "from": 0, 
    "fields": [
        "_source"
    ], 
    "explain": true, 
    "query": {
        "dis_max": {
            "queries": [
                {
                    "term": {
                        "_all": "415 931 1182"
                    }
                }
            ]
        }
    }, 
    "size": 25
},

Note: no point in using a dis_max query with only one query - doesn't
make sense.

It returns zero matches. I suspect that the tokens for the numbers are
getting filtered out, but I see nothing in the docs to indicate that
this would take place.

Mapping required before we can make any comment

I haven't been able to figure out how to see what tokens are actually
generated for a document, so I have no way to confirm this hypothesis.

If anyone has any guidance, it would be very much appreciated.

First, you need to decide how you want phone numbers to be searchable.
If a user enters "+1 415 931 1182" but you want to be able to find
"4159311182" then you're going to need to apply some rules to normalise
your phone numbers

clint

Ian_Eure · July 15, 2011, 6:34pm

On Jul 15, 2011, at 11:20 AM, Clinton Gormley wrote:

Hi Ian

Some of my data has phone numbers separated with spaces:

"phone": "+1 415 931 1182",

Others have them with nothing but the numbers:

"phone": "4159311182",

You don't mention what mapping the 'phone' field has. If you're relying
on the defaults, then it will depend on which version was mapped first,
If the first, then it will be string. If the second, it will be integer
(which is, by default) not analyzed.

u'phone': {u'type': u'string'},

Analysis would also remove the '+'

That's fine.

When I search for the former, I get no results, but searches for the
latter match just fine. This is the query I'm using:

"query": {
"sort": [
{
"_score": "desc"
}
],
"from": 0,
"fields": [
"_source"
],
"explain": true,
"query": {
"dis_max": {
"queries": [
{
"term": {
"_all": "415 931 1182"
}
}
]
}
},
"size": 25
},

Note: no point in using a dis_max query with only one query - doesn't
make sense.

Hm, okay. I'm porting stuff from Solr, and it definitely made a difference there — I guess ES doesn't break down the query into words to build the DisMax query like Solr does.

Elasticsearch Platform — Find real-time answers at scale | Elastic

Yes, I tried this — I believe you suggested it in IRC — but it only tells me how my input text is analyzed. I could use this to see how _all is broken down, but I'm not sure how that is constructed from my input document, and it does not appear to be possible to fetch the _all of a document. I updated my mapping to make _all stored, and included it in the `fields' portion of my query, but it is not returned.

If anyone has any guidance, it would be very much appreciated.

First, you need to decide how you want phone numbers to be searchable.
If a user enters "+1 415 931 1182" but you want to be able to find
"4159311182" then you're going to need to apply some rules to normalise
your phone numbers

The common case is that the number in the document is in "+1 aaa bbb cccc" format and this is also what the user enters.

Clinton_Gormley · July 15, 2011, 6:49pm

Note: no point in using a dis_max query with only one query -
doesn't
make sense.

Hm, okay. I'm porting stuff from Solr, and it definitely made a
difference there â I guess ES doesn't break down the query into words
to build the DisMax query like Solr does.

No, the dis_max query in ES is used to score separate queries. It
chooses the best matching query, as opposed to bool query, which is
additive.

Elasticsearch Platform — Find real-time answers at scale | Elastic

Yes, I tried this â I believe you suggested it in IRC â but it only
tells me how my input text is analyzed. I could use this to see how
_all is broken down, but I'm not sure how that is constructed from my
input document, and it does not appear to be possible to fetch the
_all of a document. I updated my mapping to make _all stored, and
included it in the `fields' portion of my query, but it is not
returned.

By default, the _all field uses the 'default' analyzer. So it takes all
of the fields, and runs the default analyzer on them. The analyzer that
is defined on the field doesn't affect how _all analyzes.

The common case is that the number in the document is in "+1 aaa bbb
cccc" format and this is also what the user enters.

with or without spaces? and how do you want to search on these? partial
matching? full matching?

If full matching, then I'd remove all spaces before and possibly prepend
a +1 to make all your phone numbers uniform, then use a term query.

Alternatively, you may want to separate country code, regional code and
phone number into 3 separate entities.

For partial matching, perhaps look at ngrams or edge ngrams

clint

Ian_Eure · July 15, 2011, 9:06pm

On Jul 15, 2011, at 11:49 AM, Clinton Gormley wrote:

Note: no point in using a dis_max query with only one query -
doesn't
make sense.

Hm, okay. I'm porting stuff from Solr, and it definitely made a
difference there — I guess ES doesn't break down the query into words
to build the DisMax query like Solr does.

No, the dis_max query in ES is used to score separate queries. It
chooses the best matching query, as opposed to bool query, which is
additive.

Elasticsearch Platform — Find real-time answers at scale | Elastic

Yes, I tried this — I believe you suggested it in IRC — but it only
tells me how my input text is analyzed. I could use this to see how
_all is broken down, but I'm not sure how that is constructed from my
input document, and it does not appear to be possible to fetch the
_all of a document. I updated my mapping to make _all stored, and
included it in the `fields' portion of my query, but it is not
returned.

By default, the _all field uses the 'default' analyzer. So it takes all
of the fields, and runs the default analyzer on them. The analyzer that
is defined on the field doesn't affect how _all analyzes.

I understand this, but I don't think what I said has anything to do with it. My issue is that I cannot see the contents of _all, either before or after they've been analyzed, so I don't know what information is or is not in there.

The common case is that the number in the document is in "+1 aaa bbb
cccc" format and this is also what the user enters.

with or without spaces? and how do you want to search on these? partial
matching? full matching?

With spaces, exactly like I wrote. It probably makes more sense to store it as just the number and strip any non-numeric characters from the user's input, though. I'd want to match on either the full number with country & area code, as well as plain prefix/suffix; so +10005551212, +1 000 555 1212, 000 555 1212, 555 1212 should all match "+10005551212".

If full matching, then I'd remove all spaces before and possibly prepend
a +1 to make all your phone numbers uniform, then use a term query.

I have documents with international numbers, too. I think a substring match is what I want.

kimchy · July 15, 2011, 10:04pm

Note, when doing term query, there is no analysis happening on the text provided to it. Use the text query for it.

On Saturday, July 16, 2011 at 12:06 AM, Ian Eure wrote:

On Jul 15, 2011, at 11:49 AM, Clinton Gormley wrote:

Note: no point in using a dis_max query with only one query -
doesn't
make sense.
Hm, okay. I'm porting stuff from Solr, and it definitely made a
difference there — I guess ES doesn't break down the query into words
to build the DisMax query like Solr does.
No, the dis_max query in ES is used to score separate queries. It
chooses the best matching query, as opposed to bool query, which is
additive.

Elasticsearch Platform — Find real-time answers at scale | Elastic
Yes, I tried this — I believe you suggested it in IRC — but it only
tells me how my input text is analyzed. I could use this to see how
_all is broken down, but I'm not sure how that is constructed from my
input document, and it does not appear to be possible to fetch the
_all of a document. I updated my mapping to make _all stored, and
included it in the `fields' portion of my query, but it is not
returned.

By default, the _all field uses the 'default' analyzer. So it takes all
of the fields, and runs the default analyzer on them. The analyzer that
is defined on the field doesn't affect how _all analyzes.
I understand this, but I don't think what I said has anything to do with it. My issue is that I cannot see the contents of _all, either before or after they've been analyzed, so I don't know what information is or is not in there.

The common case is that the number in the document is in "+1 aaa bbb
cccc" format and this is also what the user enters.

with or without spaces? and how do you want to search on these? partial
matching? full matching?
With spaces, exactly like I wrote. It probably makes more sense to store it as just the number and strip any non-numeric characters from the user's input, though. I'd want to match on either the full number with country & area code, as well as plain prefix/suffix; so +10005551212, +1 000 555 1212, 000 555 1212, 555 1212 should all match "+10005551212".

If full matching, then I'd remove all spaces before and possibly prepend
a +1 to make all your phone numbers uniform, then use a term query.
I have documents with international numbers, too. I think a substring match is what I want.

kimchy · July 15, 2011, 11:27pm

I wonder if it makes sense to add a phone number type, which will normalize phone numbers to a consistent format. Maybe even something that can extract phone numbers from text? Possibly either as a type, or as an analysis step token filter....

On Saturday, July 16, 2011 at 1:04 AM, Shay Banon wrote:

Note, when doing term query, there is no analysis happening on the text provided to it. Use the text query for it.

On Saturday, July 16, 2011 at 12:06 AM, Ian Eure wrote:

On Jul 15, 2011, at 11:49 AM, Clinton Gormley wrote:

Note: no point in using a dis_max query with only one query -
doesn't
make sense.
Hm, okay. I'm porting stuff from Solr, and it definitely made a
difference there — I guess ES doesn't break down the query into words
to build the DisMax query like Solr does.
No, the dis_max query in ES is used to score separate queries. It
chooses the best matching query, as opposed to bool query, which is
additive.

Elasticsearch Platform — Find real-time answers at scale | Elastic
Yes, I tried this — I believe you suggested it in IRC — but it only
tells me how my input text is analyzed. I could use this to see how
_all is broken down, but I'm not sure how that is constructed from my
input document, and it does not appear to be possible to fetch the
_all of a document. I updated my mapping to make _all stored, and
included it in the `fields' portion of my query, but it is not
returned.

By default, the _all field uses the 'default' analyzer. So it takes all
of the fields, and runs the default analyzer on them. The analyzer that
is defined on the field doesn't affect how _all analyzes.
I understand this, but I don't think what I said has anything to do with it. My issue is that I cannot see the contents of _all, either before or after they've been analyzed, so I don't know what information is or is not in there.

The common case is that the number in the document is in "+1 aaa bbb
cccc" format and this is also what the user enters.

with or without spaces? and how do you want to search on these? partial
matching? full matching?
With spaces, exactly like I wrote. It probably makes more sense to store it as just the number and strip any non-numeric characters from the user's input, though. I'd want to match on either the full number with country & area code, as well as plain prefix/suffix; so +10005551212, +1 000 555 1212, 000 555 1212, 555 1212 should all match "+10005551212".

If full matching, then I'd remove all spaces before and possibly prepend
a +1 to make all your phone numbers uniform, then use a term query.
I have documents with international numbers, too. I think a substring match is what I want.

Ian_Eure · July 15, 2011, 11:32pm

On Jul 15, 2011, at 4:27 PM, Shay Banon wrote:

I wonder if it makes sense to add a phone number type, which will normalize phone numbers to a consistent format. Maybe even something that can extract phone numbers from text? Possibly either as a type, or as an analysis step token filter....

That would be pretty neat. I'm using Google's libphonenumber1 for parsing, but it depends on knowing what country it's from, since local formats vary.

kimchy · July 15, 2011, 11:33pm

Yea, thats what I looked at as well, and wondered if it make sense to embed it in elasticsearch in one way or another...

On Saturday, July 16, 2011 at 2:32 AM, Ian Eure wrote:

On Jul 15, 2011, at 4:27 PM, Shay Banon wrote:

I wonder if it makes sense to add a phone number type, which will normalize phone numbers to a consistent format. Maybe even something that can extract phone numbers from text? Possibly either as a type, or as an analysis step token filter....
That would be pretty neat. I'm using Google's libphonenumber1 for parsing, but it depends on knowing what country it's from, since local formats vary.

Clinton_Gormley · July 16, 2011, 5:13am

Hi Ian

returned.

By default, the _all field uses the 'default' analyzer. So it takes
all
of the fields, and runs the default analyzer on them. The analyzer
that
is defined on the field doesn't affect how _all analyzes.

I understand this, but I don't think what I said has anything to do
with it. My issue is that I cannot see the contents of _all, either
before or after they've been analyzed, so I don't know what
information is or is not in there.

One thing you could do here is a terms facet on the _all field - that
way you can see what terms are stored there:

curl -XGET 'http://127.0.0.1:9200/_all/_search?pretty=1' -d '
{
"facets" : {
"all_field" : {
"terms" : {
"size" : 20,
"field" : "_all"
}
}
},
"size" : 0
}
'

clint

karmi · July 17, 2011, 6:19am

Is there any reason for not searching in the phone field, but in the
_all field?

On Jul 15, 11:06 pm, Ian Eure i...@simplegeo.com wrote:

On Jul 15, 2011, at 11:49 AM, Clinton Gormley wrote:

Note: no point in using a dis_max query with only one query -
doesn't
make sense.

Hm, okay. I'm porting stuff from Solr, and it definitely made a
difference there — I guess ES doesn't break down the query into words
to build the DisMax query like Solr does.

No, the dis_max query in ES is used to score separate queries. It
chooses the best matching query, as opposed to bool query, which is
additive.

Elasticsearch Platform — Find real-time answers at scale | Elastic...

Yes, I tried this — I believe you suggested it in IRC — but it only
tells me how my input text is analyzed. I could use this to see how
_all is broken down, but I'm not sure how that is constructed from my
input document, and it does not appear to be possible to fetch the
_all of a document. I updated my mapping to make _all stored, and
included it in the `fields' portion of my query, but it is not
returned.

By default, the _all field uses the 'default' analyzer. So it takes all
of the fields, and runs the default analyzer on them. The analyzer that
is defined on the field doesn't affect how _all analyzes.

I understand this, but I don't think what I said has anything to do with it. My issue is that I cannot see the contents of _all, either before or after they've been analyzed, so I don't know what information is or is not in there.

The common case is that the number in the document is in "+1 aaa bbb
cccc" format and this is also what the user enters.

with or without spaces? and how do you want to search on these? partial
matching? full matching?

With spaces, exactly like I wrote. It probably makes more sense to store it as just the number and strip any non-numeric characters from the user's input, though. I'd want to match on either the full number with country & area code, as well as plain prefix/suffix; so +10005551212, +1 000 555 1212, 000 555 1212, 555 1212 should all match "+10005551212".

If full matching, then I'd remove all spaces before and possibly prepend
a +1 to make all your phone numbers uniform, then use a term query.

I have documents with international numbers, too. I think a substring match is what I want.

Ian_Eure · July 20, 2011, 5:57pm

On Jul 16, 2011, at 11:19 PM, Karel Minarik wrote:

Is there any reason for not searching in the phone field, but in the
_all field?

A phone number is one of many different things that a user might enter to search. I probably need a heuristic to change the query behavior, e.g. if it's all numbers with an optional plus prefix, only query the phone number field.

karmi · July 21, 2011, 8:51am

A phone number is one of many different things that a user might enter to search. I probably need a heuristic to change the query behavior, e.g. if it's all numbers with an optional plus prefix, only query the phone number field.

Sure. It is my impression -- which may be wrong --, that you're doing
it too complicated, and not using the precious Elasticsearch features.
Why not search for multiple fields with the same user-entered query?
Ie. q=phone:<QUERY> OR name:<QUERY> OR whatever:<QUERY>. (Of course,
you can use a boolean query or whatever DSL syntax would work best for
you here.) This way, the user-entered query is analyzed in the same
way as the field in question, so phone numbers are normalized, names
are lowercased, whatever is stemmed etc.

drewdahlke · August 7, 2015, 10:26am

Ancient post, but still relevant. I took a stab at writing a phone/sip analyzer plugin using google's libphone and it's working well for us. Figured I'd share https://github.com/MyPureCloud/elasticsearch-phone