Help needed understanding analyzer behavior


(Neko Escondido) #1

Hello community,

I'm having problem understanding how analyzer should work. The result is
different from what I expect. :frowning:

I have created a custom analyzer to index phone number as below:

        "analysis" : {
           "analyzer" : {
               "phone" : {
                   "type": "custom",
                   "tokenizer":"phone_tokenizer",
                   "filter" : [ "phone_filter", "unique" ]
               }
           },
           "tokenizer" : {
               "phone_tokenizer" : {
                   "type" : "pattern",
                   "pattern":"\\s*[a-zA-Z]+\\s*"
                   
               }
           },
           "filter" : {
              "phone_filter" : {
                   "type" : "word_delimiter",
                   "preserve_original" : 1,
                   "generate_number_parts" : 1,
                   "catenate_numbers" : 1
              }
           }
       }

The intention is to match:
Query Input:
1112223333, 111.222.3333, 111-222-3333, or 111 222 3333, (111)2223333,
1-(111)-222-3333, etc.
With records containing phone number such as:
111.222.3333, 111-222-3333, or 111 222 3333, (111)2223333,
1-(111)-222-3333, etc.

So with search input: (111)2223333 with queryType "matchPhraseQuery", I
thought the query will return the records with phone number such as
111.222.3333, 111-222-3333, etc. because input (111)2223333 would be
analyzed into 1112223333, 111, and 2223333.
Given I have specified "catenate_numbers" in filter for my "phone"
analyzer, I would expect that numbers the numbers that meets the following
condition will be matched:
Match numbers that are indexed as ( 111 AND 2223333 ) OR 1112223333.
But result is no match.

Is my understanding incorrect? With search input (111)2223333 using
matchPhraseQuery, I thought it will match all numbers that has 1112223333
as the concatenated value but it seems to match only with numbers whose
number parts are 111 and 2223333...

Your feedback/help/input is greatly appreciated!!
Best regards

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b0658b33-2efb-495a-8090-7cc12806a253%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Nik Everett) #2

It's probably easier to do a char filter to remove all non digits. On the
other hand if you want to normalize numbers that sometimes contain area and
country code to numbers you'll probably want to do that outside of
elasticsearch or with a plugin. That gets difficult when you need to handle
non NANPA numbers.
On Jul 30, 2014 11:14 PM, "Neko Escondido" neko.escondido@gmail.com wrote:

Hello community,

I'm having problem understanding how analyzer should work. The result is
different from what I expect. :frowning:

I have created a custom analyzer to index phone number as below:

        "analysis" : {
           "analyzer" : {
               "phone" : {
                   "type": "custom",
                   "tokenizer":"phone_tokenizer",
                   "filter" : [ "phone_filter", "unique" ]
               }
           },
           "tokenizer" : {
               "phone_tokenizer" : {
                   "type" : "pattern",
                   "pattern":"\\s*[a-zA-Z]+\\s*"

               }
           },
           "filter" : {
              "phone_filter" : {
                   "type" : "word_delimiter",
                   "preserve_original" : 1,
                   "generate_number_parts" : 1,
                   "catenate_numbers" : 1
              }
           }
       }

The intention is to match:
Query Input:
1112223333, 111.222.3333, 111-222-3333, or 111 222 3333,
(111)2223333, 1-(111)-222-3333, etc.
With records containing phone number such as:
111.222.3333, 111-222-3333, or 111 222 3333, (111)2223333,
1-(111)-222-3333, etc.

So with search input: (111)2223333 with queryType "matchPhraseQuery", I
thought the query will return the records with phone number such as
111.222.3333, 111-222-3333, etc. because input (111)2223333 would be
analyzed into 1112223333, 111, and 2223333.
Given I have specified "catenate_numbers" in filter for my "phone"
analyzer, I would expect that numbers the numbers that meets the following
condition will be matched:
Match numbers that are indexed as ( 111 AND 2223333 ) OR 1112223333.
But result is no match.

Is my understanding incorrect? With search input (111)2223333 using
matchPhraseQuery, I thought it will match all numbers that has 1112223333
as the concatenated value but it seems to match only with numbers whose
number parts are 111 and 2223333...

Your feedback/help/input is greatly appreciated!!
Best regards

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b0658b33-2efb-495a-8090-7cc12806a253%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b0658b33-2efb-495a-8090-7cc12806a253%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd1dLr4bOVsmeudfA29Pm12AH7VPAP9%2BPieHRGi7RyAZow%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Neko Escondido) #3

Hi Nikolas
Thank you very much for your feedback. I was hoping to be able to search
against the phone number field in normalized, original, number parts format.
If I modify the input into normalized format, then, search using
original/number parts will not return the desired result...
Or am I misunderstanding your suggestion?
Multi-field indexing is an option but that is to be avoided if possible (so
that client executing query does not have to know all the possible field
names a phone number field might be mapped)...
Once again, thank you very much for your feedback. What I described above
sounds possible using char filter/plugin?

On Wednesday, July 30, 2014 8:28:35 PM UTC-7, Nikolas Everett wrote:

It's probably easier to do a char filter to remove all non digits. On the
other hand if you want to normalize numbers that sometimes contain area and
country code to numbers you'll probably want to do that outside of
elasticsearch or with a plugin. That gets difficult when you need to handle
non NANPA numbers.
On Jul 30, 2014 11:14 PM, "Neko Escondido" <neko.es...@gmail.com
<javascript:>> wrote:

Hello community,

I'm having problem understanding how analyzer should work. The result is
different from what I expect. :frowning:

I have created a custom analyzer to index phone number as below:

        "analysis" : {
            "analyzer" : {
                "phone" : {
                    "type": "custom",
                    "tokenizer":"phone_tokenizer",
                    "filter" : [ "phone_filter", "unique" ]
                }
            },
           "tokenizer" : {
                "phone_tokenizer" : {
                    "type" : "pattern",
                    "pattern":"\\s*[a-zA-Z]+\\s*"
                    
                }
           },
            "filter" : {
               "phone_filter" : {
                    "type" : "word_delimiter",
                    "preserve_original" : 1,
                    "generate_number_parts" : 1,
                    "catenate_numbers" : 1
               }
            }
       }

The intention is to match:
Query Input:
1112223333, 111.222.3333, 111-222-3333, or 111 222 3333,
(111)2223333, 1-(111)-222-3333, etc.
With records containing phone number such as:
111.222.3333, 111-222-3333, or 111 222 3333, (111)2223333,
1-(111)-222-3333, etc.

So with search input: (111)2223333 with queryType "matchPhraseQuery", I
thought the query will return the records with phone number such as
111.222.3333, 111-222-3333, etc. because input (111)2223333 would be
analyzed into 1112223333, 111, and 2223333.
Given I have specified "catenate_numbers" in filter for my "phone"
analyzer, I would expect that numbers the numbers that meets the following
condition will be matched:
Match numbers that are indexed as ( 111 AND 2223333 ) OR 1112223333.
But result is no match.

Is my understanding incorrect? With search input (111)2223333 using
matchPhraseQuery, I thought it will match all numbers that has 1112223333
as the concatenated value but it seems to match only with numbers whose
number parts are 111 and 2223333...

Your feedback/help/input is greatly appreciated!!
Best regards

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b0658b33-2efb-495a-8090-7cc12806a253%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b0658b33-2efb-495a-8090-7cc12806a253%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2227f168-ec3d-4bad-95d0-09b2082f2c08%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(sina.tamanna) #4

When I develope custom analyzers I use Analyze API to test it and
understand the tokens that will be indexed. Take a look
at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-analyze.html

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7baa88a7-1691-45d6-bb96-a7bf39813cb9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #5