Protect some words when tokenizing

Hi,

I use for some fields the standard tokenizer and I would like to know if
there is a way to prevent strings such as "c++", "c#" or ".net" to be
tokenized as "c", "c" or "net" but to be kept unmodified.

Thanks in advance

Pierre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

You could use a whitespace tokenizer instead to preserve punctuation on
this field...

curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&pretty=1' -d 'I
write C++ code.'

On Tuesday, February 12, 2013 8:19:57 AM UTC-5, Pierre de Soyres wrote:

Hi,

I use for some fields the standard tokenizer and I would like to know if
there is a way to prevent strings such as "c++", "c#" or ".net" to be
tokenized as "c", "c" or "net" but to be kept unmodified.

Thanks in advance

Pierre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thank you for response,

but using 'whitespace' is not an option for me because I need comma, dot,
dash, etc. to be delimiters as well

Pierre.

Le mardi 12 février 2013 14:47:23 UTC+1, egaumer a écrit :

You could use a whitespace tokenizer instead to preserve punctuation on
this field...

curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&pretty=1' -d 'I
write C++ code.'

On Tuesday, February 12, 2013 8:19:57 AM UTC-5, Pierre de Soyres wrote:

Hi,

I use for some fields the standard tokenizer and I would like to know if
there is a way to prevent strings such as "c++", "c#" or ".net" to be
tokenized as "c", "c" or "net" but to be kept unmodified.

Thanks in advance

Pierre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

You should be able to use a custom tuned word_delimeter to clean up
unwanted punctuation...

egaumer@ares:(src)$ curl -XPUT 'http://localhost:9200/test' -d '{
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 1
},
"analysis" : {
"filter" : {
"my_delimiter" : {
"type" : "word_delimiter",
"split_on_numerics" : true,
"split_on_case_change" : true,
"my_delimiter.catenate_numbers" : true,
"generate_word_parts" : true,
"protected_words": ["C++", "C#"]
}
}
}
}
}'

curl -XGET
'localhost:9200/test/_analyze?tokenizer=whitespace&filters=my_delimiter&pretty=1'
-d 'Hello, I write C++ code for wi-fi.'

Test that out and see if it does what you need. You can tweak other
settings on the word_delimeter to meet your needs.

On Tuesday, February 12, 2013 9:09:10 AM UTC-5, Pierre De Soyres wrote:

Thank you for response,

but using 'whitespace' is not an option for me because I need comma, dot,
dash, etc. to be delimiters as well

Pierre.

Le mardi 12 février 2013 14:47:23 UTC+1, egaumer a écrit :

You could use a whitespace tokenizer instead to preserve punctuation on
this field...

curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&pretty=1' -d 'I
write C++ code.'

On Tuesday, February 12, 2013 8:19:57 AM UTC-5, Pierre de Soyres wrote:

Hi,

I use for some fields the standard tokenizer and I would like to know if
there is a way to prevent strings such as "c++", "c#" or ".net" to be
tokenized as "c", "c" or "net" but to be kept unmodified.

Thanks in advance

Pierre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

thank you, this fits my needs

Le mardi 12 février 2013 16:11:37 UTC+1, egaumer a écrit :

You should be able to use a custom tuned word_delimeter to clean up
unwanted punctuation...

egaumer@ares:(src)$ curl -XPUT 'http://localhost:9200/test' -d '{
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 1
},
"analysis" : {
"filter" : {
"my_delimiter" : {
"type" : "word_delimiter",
"split_on_numerics" : true,
"split_on_case_change" : true,
"my_delimiter.catenate_numbers" : true,
"generate_word_parts" : true,
"protected_words": ["C++", "C#"]
}
}
}
}
}'

curl -XGET
'localhost:9200/test/_analyze?tokenizer=whitespace&filters=my_delimiter&pretty=1'
-d 'Hello, I write C++ code for wi-fi.'

Test that out and see if it does what you need. You can tweak other
settings on the word_delimeter to meet your needs.

On Tuesday, February 12, 2013 9:09:10 AM UTC-5, Pierre De Soyres wrote:

Thank you for response,

but using 'whitespace' is not an option for me because I need comma, dot,
dash, etc. to be delimiters as well

Pierre.

Le mardi 12 février 2013 14:47:23 UTC+1, egaumer a écrit :

You could use a whitespace tokenizer instead to preserve punctuation on
this field...

curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&pretty=1' -d 'I
write C++ code.'

On Tuesday, February 12, 2013 8:19:57 AM UTC-5, Pierre de Soyres wrote:

Hi,

I use for some fields the standard tokenizer and I would like to know
if there is a way to prevent strings such as "c++", "c#" or ".net" to be
tokenized as "c", "c" or "net" but to be kept unmodified.

Thanks in advance

Pierre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

If you know the list of keywords to protect, you can also use a Keyword
Marker Token Filter.

http://www.elasticsearch.org/guide/reference/index-modules/analysis/keyword-marker-tokenfilter.html

On Tue, Feb 12, 2013 at 7:44 AM, Pierre De Soyres <
pierre.de-soyres@eptica.com> wrote:

thank you, this fits my needs

Le mardi 12 février 2013 16:11:37 UTC+1, egaumer a écrit :

You should be able to use a custom tuned word_delimeter to clean up
unwanted punctuation...

egaumer@ares:(src)$ curl -XPUT 'http://localhost:9200/test' -d '{
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 1
},
"analysis" : {
"filter" : {
"my_delimiter" : {
"type" : "word_delimiter",
"split_on_numerics" : true,
"split_on_case_change" : true,
"my_delimiter.catenate_**numbers" : true,
"generate_word_parts" : true,
"protected_words": ["C++", "C#"]
}
}
}
}
}'

curl -XGET 'localhost:9200/test/_analyze?*tokenizer=whitespace&filters=
*my_delimiter&pretty=1' -d 'Hello, I write C++ code for wi-fi.'

Test that out and see if it does what you need. You can tweak other
settings on the word_delimeter to meet your needs.

On Tuesday, February 12, 2013 9:09:10 AM UTC-5, Pierre De Soyres wrote:

Thank you for response,

but using 'whitespace' is not an option for me because I need comma,
dot, dash, etc. to be delimiters as well

Pierre.

Le mardi 12 février 2013 14:47:23 UTC+1, egaumer a écrit :

You could use a whitespace tokenizer instead to preserve punctuation on
this field...

curl -XGET 'localhost:9200/_analyze?**tokenizer=whitespace&pretty=1'
-d 'I write C++ code.'

On Tuesday, February 12, 2013 8:19:57 AM UTC-5, Pierre de Soyres wrote:

Hi,

I use for some fields the standard tokenizer and I would like to know
if there is a way to prevent strings such as "c++", "c#" or ".net" to be
tokenized as "c", "c" or "net" but to be kept unmodified.

Thanks in advance

Pierre

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan, unfortunately the keywordMarkerFilter only works for in combination
with stemmers.I added the keyword attribute years ago to prevent some
stemmers from running the stemming alg on terms that are known to be names
etc. I don't think this would help here.
In general I would recommend to use a simple tokenizer like whitespace and
then use synonym filter to transform these kind of token (c++ / c#) to a
text represenations (cPLUSPLUS / CSHARP) then you can go wild with
WordDelimiterFilter etc. once you did this mapping.

simon

On Friday, February 15, 2013 5:19:23 PM UTC+1, Ivan Brusic wrote:

If you know the list of keywords to protect, you can also use a Keyword
Marker Token Filter.

http://www.elasticsearch.org/guide/reference/index-modules/analysis/keyword-marker-tokenfilter.html

On Tue, Feb 12, 2013 at 7:44 AM, Pierre De Soyres <pierre.d...@eptica.com<javascript:>

wrote:

thank you, this fits my needs

Le mardi 12 février 2013 16:11:37 UTC+1, egaumer a écrit :

You should be able to use a custom tuned word_delimeter to clean up
unwanted punctuation...

egaumer@ares:(src)$ curl -XPUT 'http://localhost:9200/test' -d '{
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 1
},
"analysis" : {
"filter" : {
"my_delimiter" : {
"type" : "word_delimiter",
"split_on_numerics" : true,
"split_on_case_change" : true,
"my_delimiter.catenate_**numbers" : true,
"generate_word_parts" : true,
"protected_words": ["C++", "C#"]
}
}
}
}
}'

curl -XGET 'localhost:9200/test/_analyze?**tokenizer=whitespace&filters=
**my_delimiter&pretty=1' -d 'Hello, I write C++ code for wi-fi.'

Test that out and see if it does what you need. You can tweak other
settings on the word_delimeter to meet your needs.

On Tuesday, February 12, 2013 9:09:10 AM UTC-5, Pierre De Soyres wrote:

Thank you for response,

but using 'whitespace' is not an option for me because I need comma,
dot, dash, etc. to be delimiters as well

Pierre.

Le mardi 12 février 2013 14:47:23 UTC+1, egaumer a écrit :

You could use a whitespace tokenizer instead to preserve punctuation
on this field...

curl -XGET 'localhost:9200/_analyze?**tokenizer=whitespace&pretty=1'
-d 'I write C++ code.'

On Tuesday, February 12, 2013 8:19:57 AM UTC-5, Pierre de Soyres wrote:

Hi,

I use for some fields the standard tokenizer and I would like to know
if there is a way to prevent strings such as "c++", "c#" or ".net" to be
tokenized as "c", "c" or "net" but to be kept unmodified.

Thanks in advance

Pierre

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

My mistake! I read the word "protect" and thought of the keyword marker
filter. I once wrote a custom token filter on a Lucene project I was on,
not related to stemming, that used the keyword attributes. Useful
attribute, but it is post tokenization and not what the OP is looking for.

Nowadays in Lucene I use a pattern tokenizer since the whitespace tokenizer
is too lenient, plus a word_delimiter filter (and stemmer overrides).

--
Ivan

On Sat, Feb 16, 2013 at 7:15 AM, simonw
simon.willnauer@elasticsearch.comwrote:

Ivan, unfortunately the keywordMarkerFilter only works for in combination
with stemmers.I added the keyword attribute years ago to prevent some
stemmers from running the stemming alg on terms that are known to be names
etc. I don't think this would help here.
In general I would recommend to use a simple tokenizer like whitespace and
then use synonym filter to transform these kind of token (c++ / c#) to a
text represenations (cPLUSPLUS / CSHARP) then you can go wild with
WordDelimiterFilter etc. once you did this mapping.

simon

On Friday, February 15, 2013 5:19:23 PM UTC+1, Ivan Brusic wrote:

If you know the list of keywords to protect, you can also use a Keyword
Marker Token Filter.

http://www.elasticsearch.org/guide/reference/index-modules/
analysis/keyword-marker-**tokenfilter.htmlhttp://www.elasticsearch.org/guide/reference/index-modules/analysis/keyword-marker-tokenfilter.html

On Tue, Feb 12, 2013 at 7:44 AM, Pierre De Soyres <pierre.d...@eptica.com

wrote:

thank you, this fits my needs

Le mardi 12 février 2013 16:11:37 UTC+1, egaumer a écrit :

You should be able to use a custom tuned word_delimeter to clean up
unwanted punctuation...

egaumer@ares:(src)$ curl -XPUT 'http://localhost:9200/test' -d '{
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 1
},
"analysis" : {
"filter" : {
"my_delimiter" : {
"type" : "word_delimiter",
"split_on_numerics" : true,
"split_on_case_change" : true,
"my_delimiter.catenate_numbers" : true,
"generate_word_parts" : true,
"protected_words": ["C++", "C#"]
}
}
}
}
}'

curl -XGET 'localhost:9200/test/_analyze?****
tokenizer=whitespace&filters=my_delimiter&pretty=1' -d 'Hello, I
write C++ code for wi-fi.'

Test that out and see if it does what you need. You can tweak other
settings on the word_delimeter to meet your needs.

On Tuesday, February 12, 2013 9:09:10 AM UTC-5, Pierre De Soyres wrote:

Thank you for response,

but using 'whitespace' is not an option for me because I need comma,
dot, dash, etc. to be delimiters as well

Pierre.

Le mardi 12 février 2013 14:47:23 UTC+1, egaumer a écrit :

You could use a whitespace tokenizer instead to preserve punctuation
on this field...

curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&pretty=1'
-d 'I write C++ code.'

On Tuesday, February 12, 2013 8:19:57 AM UTC-5, Pierre de Soyres
wrote:

Hi,

I use for some fields the standard tokenizer and I would like to
know if there is a way to prevent strings such as "c++", "c#" or ".net" to
be tokenized as "c", "c" or "net" but to be kept unmodified.

Thanks in advance

Pierre

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.