Special Characters not indexed and hence not searchable

I saw many threads discuss it. I have crossed few hurdles, last one is
still bothering me. I had email address, ":", "/" in my data (which need to
be indexed and searched). Now I am able to search the email address
test@myco.com, but I still cannot have the following characters indexed:
":" "/" and "-"
. Any help is greatly appreciated. Is it issue with my
index_analyzer or search_analyzer?

Thanks in Advance
-Praveen

Here is my mapping:

ESINDEX = {
"number_of_shards": 1,
"analysis": {
"filter": {
"mynGram" : {
"type" : "nGram",
"min_gram": 1,
"max_gram": 50
}
},
"analyzer": {
"a1" : {
"type" :"custom",
"tokenizer":"uax_url_email",
"filter" : ["mynGram"]
}
}
}
}

ESMAPPINGS = {
"index_analyzer" : "a1",
"search_analyzer" : "whitespace",
"properties" : {
u'test_field1' : {
'index' : 'not_analyzed',
'type' : u'string',
'store' : 'yes'
},
u'testfield2' : {
'index' : 'not_analyzed',
'type' : u'string',
'store' : 'yes'
},
u'email' : {
'index': 'not_analyzed',
'type' : u'string',
'store': 'yes'
},
:::::::::::
}

Do you mean you want to have ":", "/" searchable?
if so you may have to use a custom tokenizer and specify a pattern to
tokenize.

This will include all unicode characters plus the special characters such
as "/" and ":"

'tokenizer' : {

        'email_tokenizer' : {

            'type' : { 'pattern',

            'pattern' => "[@*\/*:*\.*\\w\\p{L}]+"

             }

        }

    }

On Thursday, July 26, 2012 2:28:46 PM UTC-4, Praveen Kariyanahalli wrote:

I saw many threads discuss it. I have crossed few hurdles, last one is
still bothering me. I had email address, ":", "/" in my data (which need to
be indexed and searched). Now I am able to search the email address
test@myco.com, but I still cannot have the following characters indexed:
":" "/" and "-"
. Any help is greatly appreciated. Is it issue with my
index_analyzer or search_analyzer?

Thanks in Advance
-Praveen

Here is my mapping:

ESINDEX = {
"number_of_shards": 1,
"analysis": {
"filter": {
"mynGram" : {
"type" : "nGram",
"min_gram": 1,
"max_gram": 50
}
},
"analyzer": {
"a1" : {
"type" :"custom",
"tokenizer":"uax_url_email",
"filter" : ["mynGram"]
}
}
}
}

ESMAPPINGS = {
"index_analyzer" : "a1",
"search_analyzer" : "whitespace",
"properties" : {
u'test_field1' : {
'index' : 'not_analyzed',
'type' : u'string',
'store' : 'yes'
},
u'testfield2' : {
'index' : 'not_analyzed',
'type' : u'string',
'store' : 'yes'
},
u'email' : {
'index': 'not_analyzed',
'type' : u'string',
'store': 'yes'
},
:::::::::::
}

Hi Joe

As per you suggestion, I changed my tokenizer to the following. But it
doesnt help. I have lost my initial ngram indexing too? Am I missing
something?

ESINDEX = {
"number_of_shards": 1,
"analysis": {
"filter": {
"mynGram" : {
"type" : "nGram",
"min_gram": 1,
"max_gram": 50
}
},
"analyzer": {
"a1" : {
"type" : "pattern",
"pattern" : "[@/:.\w\p{L}]+",
"filter" : ["mynGram"]
}
}
}
}

I did not understand the significance of the '*' in your pattern. Also you
are saying 'Letters' and 'word' and mention any number of occurrences of
that (the end +?).

Can you please clarify?

Thanks
-Praveen

On Thursday, July 26, 2012 1:53:57 PM UTC-7, Joe Wong wrote:

Do you mean you want to have ":", "/" searchable?
if so you may have to use a custom tokenizer and specify a pattern to
tokenize.

This will include all unicode characters plus the special characters such
as "/" and ":"

'tokenizer' : {

        'email_tokenizer' : {

            'type' : { 'pattern',

            'pattern' => "[@*\/*:*\.*\\w\\p{L}]+"

             }

        }

    }

On Thursday, July 26, 2012 2:28:46 PM UTC-4, Praveen Kariyanahalli wrote:

I saw many threads discuss it. I have crossed few hurdles, last one is
still bothering me. I had email address, ":", "/" in my data (which need to
be indexed and searched). Now I am able to search the email address
test@myco.com, but I still cannot have the following characters
indexed: ":" "/" and "-"
. Any help is greatly appreciated. Is it issue
with my index_analyzer or search_analyzer?

Thanks in Advance
-Praveen

Here is my mapping:

ESINDEX = {
"number_of_shards": 1,
"analysis": {
"filter": {
"mynGram" : {
"type" : "nGram",
"min_gram": 1,
"max_gram": 50
}
},
"analyzer": {
"a1" : {
"type" :"custom",
"tokenizer":"uax_url_email",
"filter" : ["mynGram"]
}
}
}
}

ESMAPPINGS = {
"index_analyzer" : "a1",
"search_analyzer" : "whitespace",
"properties" : {
u'test_field1' : {
'index' : 'not_analyzed',
'type' : u'string',
'store' : 'yes'
},
u'testfield2' : {
'index' : 'not_analyzed',
'type' : u'string',
'store' : 'yes'
},
u'email' : {
'index': 'not_analyzed',
'type' : u'string',
'store': 'yes'
},
:::::::::::
}

Just in case I was not clear,

I want my tokens to be anything that is made of letters, digits, @, -, .,
:, /

Here is the pattern I am trying: "pattern" :
"[@-.:/\p{L}\d]+"

On this token I need ngram filter.

Any help is greatly appreciated.

Thanks in Advnace
-pk

ESINDEX = {
"number_of_shards": 1,
"analysis": {
"filter": {
"mynGram" : {
"type" : "nGram",
"min_gram": 1,
"max_gram": 50
}
},
"analyzer": {
"a1" : {
"type" : "pattern",
"filter" : ["mynGram"],
"pattern" : "[@-.:/\p{L}\d]+"
}
}
}
}

ESMAPPINGS = {
"index_analyzer" : "a1",
"search_analyzer" : "whitespace",
"date_formats" : ["yyyy-MM-dd", "MM-dd-yyyy"],
"properties" : {
u'my_field' : {
'index' : 'not_analyzed',
'type' : u'string',
'store' : 'yes'
},
::::::::::::::
}

ah yes '*' aren't required

you need to define the custom tokenizer for your analyzer

i.e.

ESINDEX = {
"number_of_shards": 1,
"analysis": {
"filter": {
"mynGram" : {
"type" : "nGram",
"min_gram": 1,
"max_gram": 50
}
},
"analyzer": {
"a1" : {
"type" :"custom",
"tokenizer":"email_tokenizer",
"filter" : ["mynGram"]
}
},
"tokenizer" : {
"email_tokenizer" : {
"type" : 'pattern',
"pattern" => "[@/:.\w\p{L}]+"
}
}
}
}

On Thursday, July 26, 2012 5:57:36 PM UTC-4, Praveen Kariyanahalli wrote:

Hi Joe

As per you suggestion, I changed my tokenizer to the following. But it
doesnt help. I have lost my initial ngram indexing too? Am I missing
something?

ESINDEX = {
"number_of_shards": 1,
"analysis": {
"filter": {
"mynGram" : {
"type" : "nGram",
"min_gram": 1,
"max_gram": 50
}
},
"analyzer": {
"a1" : {
"type" : "pattern",
"pattern" : "[@/:.\w\p{L}]+",
"filter" : ["mynGram"]
}
},

            }
        }

I did not understand the significance of the '*' in your pattern. Also you
are saying 'Letters' and 'word' and mention any number of occurrences of
that (the end +?).

Can you please clarify?

Thanks
-Praveen

On Thursday, July 26, 2012 1:53:57 PM UTC-7, Joe Wong wrote:

Do you mean you want to have ":", "/" searchable?
if so you may have to use a custom tokenizer and specify a pattern to
tokenize.

This will include all unicode characters plus the special characters such
as "/" and ":"

'tokenizer' : {

        'email_tokenizer' : {

            'type' : { 'pattern',

            'pattern' => "[@*\/*:*\.*\\w\\p{L}]+"

             }

        }

    }

On Thursday, July 26, 2012 2:28:46 PM UTC-4, Praveen Kariyanahalli wrote:

I saw many threads discuss it. I have crossed few hurdles, last one is
still bothering me. I had email address, ":", "/" in my data (which need to
be indexed and searched). Now I am able to search the email address
test@myco.com, but I still cannot have the following characters
indexed: ":" "/" and "-"
. Any help is greatly appreciated. Is it issue
with my index_analyzer or search_analyzer?

Thanks in Advance
-Praveen

Here is my mapping:

ESINDEX = {
"number_of_shards": 1,
"analysis": {
"filter": {
"mynGram" : {
"type" : "nGram",
"min_gram": 1,
"max_gram": 50
}
},
"analyzer": {
"a1" : {
"type" :"custom",
"tokenizer":"uax_url_email",
"filter" : ["mynGram"]
}
}
}
}

ESMAPPINGS = {
"index_analyzer" : "a1",
"search_analyzer" : "whitespace",
"properties" : {
u'test_field1' : {
'index' : 'not_analyzed',
'type' : u'string',
'store' : 'yes'
},
u'testfield2' : {
'index' : 'not_analyzed',
'type' : u'string',
'store' : 'yes'
},
u'email' : {
'index': 'not_analyzed',
'type' : u'string',
'store': 'yes'
},
:::::::::::
}

This finally worked (see below). I had to negate the pattern that I want
to be tokenized. It looks like pattern is defining what my delimiter is. So
in my case keyword pattern is saying, tokenize until you see character
other than @, :, /, ., !, =, -, letter or digit, then go on to apply filter
on them. I reread the documentation, then I got
it: http://www.elasticsearch.org/guide/reference/index-modules/analysis/pattern-analyzer.html

                "tokenizer" : {
                    "email_tokenizer" : {
                        "type" : "pattern",
                       * "pattern" : "[^@:\/\.\!\=\-\\w\\p{L}\\d]+"*
                    }
                },
                "analyzer": {
                    "a1" : {
                        "type" : "custom",
                        "tokenizer":"email_tokenizer",
                        "filter"   : ["mynGram"]
                    }
                }

Am using the pattern "[^@:/.!=-\w\p{L}\d]+" in sense for settings. It is showing showing Bad string sytax error. Do I need to add anything for this pattern inorder to accept the string in sense.

Where I would like to add special characters like'-','/','(',')' to my pattern..

Am using the pattern "[^@:/.!=-\w\p{L}\d]+" in sense for settings.
It is showing showing Bad string sytax error. Do I need to add anything for
this pattern inorder to accept the string in sense.
Where I would like to add special characters like'-','/','(',')' to my
pattern..

Here is my settings:
"analysis": {
"analyzer": {
"my_analyzer":
{
"type":"custom",
"tokenizer":"special_tokenizer",
"filter" : ["mynGram"]
}
},
"tokenizer": {
"special_tokenizer":
{
"type" : "pattern",
"pattern" :
"[^-/\w\p{L}\d]+" Here am
getting Bad String syntax error in sense , any other way of giving the
string
}
},
"filter": {
"mynGram" : {
"type" : "nGram",
"min_gram": 1,
"max_gram": 50
}
}

    }

On Thursday, July 26, 2012 at 11:58:46 PM UTC+5:30, Praveen Kariyanahalli
wrote:

I saw many threads discuss it. I have crossed few hurdles, last one is
still bothering me. I had email address, ":", "/" in my data (which need to
be indexed and searched). Now I am able to search the email address
te...@myco.com <javascript:>, but I still cannot have the following
characters indexed: ":" "/" and "-"
. Any help is greatly appreciated. Is
it issue with my index_analyzer or search_analyzer?

Thanks in Advance
-Praveen

Here is my mapping:

ESINDEX = {
"number_of_shards": 1,
"analysis": {
"filter": {
"mynGram" : {
"type" : "nGram",
"min_gram": 1,
"max_gram": 50
}
},
"analyzer": {
"a1" : {
"type" :"custom",
"tokenizer":"uax_url_email",
"filter" : ["mynGram"]
}
}
}
}

ESMAPPINGS = {
"index_analyzer" : "a1",
"search_analyzer" : "whitespace",
"properties" : {
u'test_field1' : {
'index' : 'not_analyzed',
'type' : u'string',
'store' : 'yes'
},
u'testfield2' : {
'index' : 'not_analyzed',
'type' : u'string',
'store' : 'yes'
},
u'email' : {
'index': 'not_analyzed',
'type' : u'string',
'store': 'yes'
},
:::::::::::
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c431b584-a4a2-4b86-89ce-b9cce43d3e79%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.