Issue with using word delimiter filter

Amit_Soni · October 31, 2013, 7:10pm

Hi all - I have a phone number field and I am trying to use word_delimiter
filter in order break it up into tokens, preserve the original entry and
concatenate all the numbers in the entry. I have the following entry:

"phoneAnalyzer" : {
"type": "custom",
"tokenizer": "standard",
"filter": [
"word_delimiter_for_phone"
]
}

"filter": {
"word_delimiter_for_phone": {
"type": "word_delimiter",

                "catenate_numbers" : true,*
               "preserve_original" : true
          },

}

Using this, when I run it on input "345 678-1234" I get the following:

{
"tokens" : [ {
"token" : "345",
"start_offset" : 0,
"end_offset" : 3,
"type" : "",
"position" : 1
}, {
"token" : "678",
"start_offset" : 4,
"end_offset" : 7,
"type" : "",
"position" : 2
}, {
"token" : "1234",
"start_offset" : 8,
"end_offset" : 12,
"type" : "",
"position" : 3
} ]
}

Question: Should this also not have generated a concatenated string of the
form: 3456781234.

Anything I am missing here?

-Amit.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

sina_tamanna · November 1, 2013, 7:42am

Analysis starts by using tokenizer, which in your case is "standard".
Therefore the input "345 678-1234" will be tokenized to "345", "678", and
"1234", and only then the filters will be applied. A solution to get the
original and the concatenated input would be to use the "keyword" tokenizer.

On Thursday, October 31, 2013 8:10:55 PM UTC+1, amit.soni wrote:

Hi all - I have a phone number field and I am trying to use word_delimiter
filter in order break it up into tokens, preserve the original entry and
concatenate all the numbers in the entry. I have the following entry:

"phoneAnalyzer" : {
"type": "custom",
"tokenizer": "standard",
"filter": [
"word_delimiter_for_phone"
]
}

"filter": {
"word_delimiter_for_phone": {
"type": "word_delimiter",
                "catenate_numbers" : true,*
               "preserve_original" : true 
          },
}

Using this, when I run it on input "345 678-1234" I get the following:

{
"tokens" : [ {
"token" : "345",
"start_offset" : 0,
"end_offset" : 3,
"type" : "",
"position" : 1
}, {
"token" : "678",
"start_offset" : 4,
"end_offset" : 7,
"type" : "",
"position" : 2
}, {
"token" : "1234",
"start_offset" : 8,
"end_offset" : 12,
"type" : "",
"position" : 3
} ]
}

Question: Should this also not have generated a concatenated string of the
form: 3456781234.

Anything I am missing here?

-Amit.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

dadoonet · November 1, 2013, 8:05am

Or disable analysis for this field.

HTH

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 1 nov. 2013 à 07:42, sina.tamanna@gmail.com a écrit :

Analysis starts by using tokenizer, which in your case is "standard". Therefore the input "345 678-1234" will be tokenized to "345", "678", and "1234", and only then the filters will be applied. A solution to get the original and the concatenated input would be to use the "keyword" tokenizer.

On Thursday, October 31, 2013 8:10:55 PM UTC+1, amit.soni wrote:

Hi all - I have a phone number field and I am trying to use word_delimiter filter in order break it up into tokens, preserve the original entry and concatenate all the numbers in the entry. I have the following entry:

"phoneAnalyzer" : {
"type": "custom",
"tokenizer": "standard",
"filter": [
"word_delimiter_for_phone"
]
}

"filter": {
"word_delimiter_for_phone": {
"type": "word_delimiter",
"catenate_numbers" : true,
"preserve_original" : true
},
}

Using this, when I run it on input "345 678-1234" I get the following:

{
"tokens" : [ {
"token" : "345",
"start_offset" : 0,
"end_offset" : 3,
"type" : "",
"position" : 1
}, {
"token" : "678",
"start_offset" : 4,
"end_offset" : 7,
"type" : "",
"position" : 2
}, {
"token" : "1234",
"start_offset" : 8,
"end_offset" : 12,
"type" : "",
"position" : 3
} ]
}

Question: Should this also not have generated a concatenated string of the form: 3456781234.

Anything I am missing here?

-Amit.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

dadoonet · November 1, 2013, 8:07am

Sorry. Forget my answer. Useless here.

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 1 nov. 2013 à 08:05, David Pilato david@pilato.fr a écrit :

Or disable analysis for this field.

HTH

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 1 nov. 2013 à 07:42, sina.tamanna@gmail.com a écrit :

Analysis starts by using tokenizer, which in your case is "standard". Therefore the input "345 678-1234" will be tokenized to "345", "678", and "1234", and only then the filters will be applied. A solution to get the original and the concatenated input would be to use the "keyword" tokenizer.

On Thursday, October 31, 2013 8:10:55 PM UTC+1, amit.soni wrote:

Hi all - I have a phone number field and I am trying to use word_delimiter filter in order break it up into tokens, preserve the original entry and concatenate all the numbers in the entry. I have the following entry:

"phoneAnalyzer" : {
"type": "custom",
"tokenizer": "standard",
"filter": [
"word_delimiter_for_phone"
]
}

"filter": {
"word_delimiter_for_phone": {
"type": "word_delimiter",
"catenate_numbers" : true,
"preserve_original" : true
},
}

Using this, when I run it on input "345 678-1234" I get the following:

{
"tokens" : [ {
"token" : "345",
"start_offset" : 0,
"end_offset" : 3,
"type" : "",
"position" : 1
}, {
"token" : "678",
"start_offset" : 4,
"end_offset" : 7,
"type" : "",
"position" : 2
}, {
"token" : "1234",
"start_offset" : 8,
"end_offset" : 12,
"type" : "",
"position" : 3
} ]
}

Question: Should this also not have generated a concatenated string of the form: 3456781234.

Anything I am missing here?

-Amit.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Amit_Soni · April 22, 2014, 3:46am

hi everyone - I have changed the mapping so that it now looks like below.
However for a given input say 123-456-8989, the generated tokens are:

a) 123-456-8989 b) 123 c) 456 d) 8989 e) 1234568989

I was expecting just two tokens: a) 123-456-8989 b) 1234568989

Would you know what might be going wrong here?

"default_index": {
"tokenizer": "keyword",
"filter": [
"lowercase"
]
},

"phoneAnalyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"word_delimiter_for_phone"
]
},

"word_delimiter_for_phone": {
"type": "word_delimiter",
"catenate_all": true,
"generate_number_parts ": false,
"split_on_case_change": false,
"generate_word_parts": false,
"split_on_numerics": false,
"preserve_original": true
},

-Amit.

On Fri, Nov 1, 2013 at 1:07 AM, David Pilato david@pilato.fr wrote:

Sorry. Forget my answer. Useless here.

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 1 nov. 2013 à 08:05, David Pilato david@pilato.fr a écrit :

Or disable analysis for this field.

HTH

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 1 nov. 2013 à 07:42, sina.tamanna@gmail.com a écrit :

Analysis starts by using tokenizer, which in your case is "standard".
Therefore the input "345 678-1234" will be tokenized to "345", "678", and
"1234", and only then the filters will be applied. A solution to get the
original and the concatenated input would be to use the "keyword" tokenizer.

On Thursday, October 31, 2013 8:10:55 PM UTC+1, amit.soni wrote:
Hi all - I have a phone number field and I am trying to use
word_delimiter filter in order break it up into tokens, preserve the
original entry and concatenate all the numbers in the entry. I have the
following entry:

"phoneAnalyzer" : {
"type": "custom",
"tokenizer": "standard",
"filter": [
"word_delimiter_for_phone"
]
}

"filter": {
"word_delimiter_for_phone": {
"type": "word_delimiter",
                "catenate_numbers" : true,*
               "preserve_original" : true
          },
}

Using this, when I run it on input "345 678-1234" I get the following:

{
"tokens" : [ {
"token" : "345",
"start_offset" : 0,
"end_offset" : 3,
"type" : "",
"position" : 1
}, {
"token" : "678",
"start_offset" : 4,
"end_offset" : 7,
"type" : "",
"position" : 2
}, {
"token" : "1234",
"start_offset" : 8,
"end_offset" : 12,
"type" : "",
"position" : 3
} ]
}

Question: Should this also not have generated a concatenated string of
the form: 3456781234.

Anything I am missing here?

-Amit.
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAAOGaQKiEQhwJFfVwTBEHkeF%2BCK%2B8zpw6WC%2BpmSDeUgjTFtN2Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Issue with using word delimiter Elasticsearch	1	605	July 6, 2017
Word Delimiter Filter Elasticsearch	1	302	July 6, 2017
Protect some words when tokenizing Elasticsearch	8	2120	July 6, 2017
WordDelimiterTokenFilter used twice in same analyzer with different configurations causes issues Elasticsearch	7	2053	March 21, 2018
Word_delimiter with split_on_numerics removes all tokens Elasticsearch	2	684	July 6, 2017

Issue with using word delimiter filter

Related topics