Protect some words when tokenizing

pdesoyres · February 12, 2013, 1:19pm

Hi,

I use for some fields the standard tokenizer and I would like to know if
there is a way to prevent strings such as "c++", "c#" or ".net" to be
tokenized as "c", "c" or "net" but to be kept unmodified.

Thanks in advance

Pierre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

egaumer · February 12, 2013, 1:47pm

You could use a whitespace tokenizer instead to preserve punctuation on
this field...

curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&pretty=1' -d 'I
write C++ code.'

On Tuesday, February 12, 2013 8:19:57 AM UTC-5, Pierre de Soyres wrote:

Hi,

I use for some fields the standard tokenizer and I would like to know if
there is a way to prevent strings such as "c++", "c#" or ".net" to be
tokenized as "c", "c" or "net" but to be kept unmodified.

Thanks in advance

Pierre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Pierre_De_Soyres · February 12, 2013, 2:09pm

Thank you for response,

but using 'whitespace' is not an option for me because I need comma, dot,
dash, etc. to be delimiters as well

Pierre.

Le mardi 12 février 2013 14:47:23 UTC+1, egaumer a écrit :

You could use a whitespace tokenizer instead to preserve punctuation on
this field...

curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&pretty=1' -d 'I
write C++ code.'

On Tuesday, February 12, 2013 8:19:57 AM UTC-5, Pierre de Soyres wrote:

Hi,

I use for some fields the standard tokenizer and I would like to know if
there is a way to prevent strings such as "c++", "c#" or ".net" to be
tokenized as "c", "c" or "net" but to be kept unmodified.

Thanks in advance

Pierre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

egaumer · February 12, 2013, 3:11pm

You should be able to use a custom tuned word_delimeter to clean up
unwanted punctuation...

egaumer@ares:(src)$ curl -XPUT 'http://localhost:9200/test' -d '{
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 1
},
"analysis" : {
"filter" : {
"my_delimiter" : {
"type" : "word_delimiter",
"split_on_numerics" : true,
"split_on_case_change" : true,
"my_delimiter.catenate_numbers" : true,
"generate_word_parts" : true,
"protected_words": ["C++", "C#"]
}
}
}
}
}'

curl -XGET
'localhost:9200/test/_analyze?tokenizer=whitespace&filters=my_delimiter&pretty=1'
-d 'Hello, I write C++ code for wi-fi.'

Test that out and see if it does what you need. You can tweak other
settings on the word_delimeter to meet your needs.

On Tuesday, February 12, 2013 9:09:10 AM UTC-5, Pierre De Soyres wrote:

Thank you for response,

but using 'whitespace' is not an option for me because I need comma, dot,
dash, etc. to be delimiters as well

Pierre.

Le mardi 12 février 2013 14:47:23 UTC+1, egaumer a écrit :

You could use a whitespace tokenizer instead to preserve punctuation on
this field...

curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&pretty=1' -d 'I
write C++ code.'

On Tuesday, February 12, 2013 8:19:57 AM UTC-5, Pierre de Soyres wrote:

Hi,

I use for some fields the standard tokenizer and I would like to know if
there is a way to prevent strings such as "c++", "c#" or ".net" to be
tokenized as "c", "c" or "net" but to be kept unmodified.

Thanks in advance

Pierre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Pierre_De_Soyres · February 12, 2013, 3:44pm

thank you, this fits my needs

Le mardi 12 février 2013 16:11:37 UTC+1, egaumer a écrit :

You should be able to use a custom tuned word_delimeter to clean up
unwanted punctuation...

egaumer@ares:(src)$ curl -XPUT 'http://localhost:9200/test' -d '{
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 1
},
"analysis" : {
"filter" : {
"my_delimiter" : {
"type" : "word_delimiter",
"split_on_numerics" : true,
"split_on_case_change" : true,
"my_delimiter.catenate_numbers" : true,
"generate_word_parts" : true,
"protected_words": ["C++", "C#"]
}
}
}
}
}'

curl -XGET
'localhost:9200/test/_analyze?tokenizer=whitespace&filters=my_delimiter&pretty=1'
-d 'Hello, I write C++ code for wi-fi.'

Test that out and see if it does what you need. You can tweak other
settings on the word_delimeter to meet your needs.

On Tuesday, February 12, 2013 9:09:10 AM UTC-5, Pierre De Soyres wrote:

Thank you for response,

but using 'whitespace' is not an option for me because I need comma, dot,
dash, etc. to be delimiters as well

Pierre.

Le mardi 12 février 2013 14:47:23 UTC+1, egaumer a écrit :

You could use a whitespace tokenizer instead to preserve punctuation on
this field...

curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&pretty=1' -d 'I
write C++ code.'

On Tuesday, February 12, 2013 8:19:57 AM UTC-5, Pierre de Soyres wrote:

Hi,

I use for some fields the standard tokenizer and I would like to know
if there is a way to prevent strings such as "c++", "c#" or ".net" to be
tokenized as "c", "c" or "net" but to be kept unmodified.

Thanks in advance

Pierre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · February 15, 2013, 4:19pm

If you know the list of keywords to protect, you can also use a Keyword
Marker Token Filter.

On Tue, Feb 12, 2013 at 7:44 AM, Pierre De Soyres <
pierre.de-soyres@eptica.com> wrote:

thank you, this fits my needs

Le mardi 12 février 2013 16:11:37 UTC+1, egaumer a écrit :

You should be able to use a custom tuned word_delimeter to clean up
unwanted punctuation...

egaumer@ares:(src)$ curl -XPUT 'http://localhost:9200/test' -d '{
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 1
},
"analysis" : {
"filter" : {
"my_delimiter" : {
"type" : "word_delimiter",
"split_on_numerics" : true,
"split_on_case_change" : true,
"my_delimiter.catenate_**numbers" : true,
"generate_word_parts" : true,
"protected_words": ["C++", "C#"]
}
}
}
}
}'

curl -XGET 'localhost:9200/test/_analyze?*tokenizer=whitespace&filters=
*my_delimiter&pretty=1' -d 'Hello, I write C++ code for wi-fi.'

Test that out and see if it does what you need. You can tweak other
settings on the word_delimeter to meet your needs.

On Tuesday, February 12, 2013 9:09:10 AM UTC-5, Pierre De Soyres wrote:

Thank you for response,

but using 'whitespace' is not an option for me because I need comma,
dot, dash, etc. to be delimiters as well

Pierre.

Le mardi 12 février 2013 14:47:23 UTC+1, egaumer a écrit :

You could use a whitespace tokenizer instead to preserve punctuation on
this field...

curl -XGET 'localhost:9200/_analyze?**tokenizer=whitespace&pretty=1'
-d 'I write C++ code.'

On Tuesday, February 12, 2013 8:19:57 AM UTC-5, Pierre de Soyres wrote:

Hi,

I use for some fields the standard tokenizer and I would like to know
if there is a way to prevent strings such as "c++", "c#" or ".net" to be
tokenized as "c", "c" or "net" but to be kept unmodified.

Thanks in advance

Pierre

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

simonw_2 · February 16, 2013, 3:15pm

Ivan, unfortunately the keywordMarkerFilter only works for in combination
with stemmers.I added the keyword attribute years ago to prevent some
stemmers from running the stemming alg on terms that are known to be names
etc. I don't think this would help here.
In general I would recommend to use a simple tokenizer like whitespace and
then use synonym filter to transform these kind of token (c++ / c#) to a
text represenations (cPLUSPLUS / CSHARP) then you can go wild with
WordDelimiterFilter etc. once you did this mapping.

simon

On Friday, February 15, 2013 5:19:23 PM UTC+1, Ivan Brusic wrote:

If you know the list of keywords to protect, you can also use a Keyword
Marker Token Filter.

Elasticsearch Platform — Find real-time answers at scale | Elastic

On Tue, Feb 12, 2013 at 7:44 AM, Pierre De Soyres <pierre.d...@eptica.com<javascript:>

wrote:

thank you, this fits my needs

Le mardi 12 février 2013 16:11:37 UTC+1, egaumer a écrit :

You should be able to use a custom tuned word_delimeter to clean up
unwanted punctuation...

egaumer@ares:(src)$ curl -XPUT 'http://localhost:9200/test' -d '{
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 1
},
"analysis" : {
"filter" : {
"my_delimiter" : {
"type" : "word_delimiter",
"split_on_numerics" : true,
"split_on_case_change" : true,
"my_delimiter.catenate_**numbers" : true,
"generate_word_parts" : true,
"protected_words": ["C++", "C#"]
}
}
}
}
}'

curl -XGET 'localhost:9200/test/_analyze?**tokenizer=whitespace&filters=
**my_delimiter&pretty=1' -d 'Hello, I write C++ code for wi-fi.'

Test that out and see if it does what you need. You can tweak other
settings on the word_delimeter to meet your needs.

On Tuesday, February 12, 2013 9:09:10 AM UTC-5, Pierre De Soyres wrote:

Thank you for response,

but using 'whitespace' is not an option for me because I need comma,
dot, dash, etc. to be delimiters as well

Pierre.

Le mardi 12 février 2013 14:47:23 UTC+1, egaumer a écrit :

You could use a whitespace tokenizer instead to preserve punctuation
on this field...

curl -XGET 'localhost:9200/_analyze?**tokenizer=whitespace&pretty=1'
-d 'I write C++ code.'

On Tuesday, February 12, 2013 8:19:57 AM UTC-5, Pierre de Soyres wrote:

Hi,

I use for some fields the standard tokenizer and I would like to know
if there is a way to prevent strings such as "c++", "c#" or ".net" to be
tokenized as "c", "c" or "net" but to be kept unmodified.

Thanks in advance

Pierre

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · February 18, 2013, 4:43pm

My mistake! I read the word "protect" and thought of the keyword marker
filter. I once wrote a custom token filter on a Lucene project I was on,
not related to stemming, that used the keyword attributes. Useful
attribute, but it is post tokenization and not what the OP is looking for.

Nowadays in Lucene I use a pattern tokenizer since the whitespace tokenizer
is too lenient, plus a word_delimiter filter (and stemmer overrides).

--
Ivan

On Sat, Feb 16, 2013 at 7:15 AM, simonw
simon.willnauer@elasticsearch.comwrote:

Ivan, unfortunately the keywordMarkerFilter only works for in combination
with stemmers.I added the keyword attribute years ago to prevent some
stemmers from running the stemming alg on terms that are known to be names
etc. I don't think this would help here.
In general I would recommend to use a simple tokenizer like whitespace and
then use synonym filter to transform these kind of token (c++ / c#) to a
text represenations (cPLUSPLUS / CSHARP) then you can go wild with
WordDelimiterFilter etc. once you did this mapping.

simon

On Friday, February 15, 2013 5:19:23 PM UTC+1, Ivan Brusic wrote:

If you know the list of keywords to protect, you can also use a Keyword
Marker Token Filter.

Elasticsearch Platform — Find real-time answers at scale | Elastic**
analysis/keyword-marker-**tokenfilter.htmlhttp://www.elasticsearch.org/guide/reference/index-modules/analysis/keyword-marker-tokenfilter.html

On Tue, Feb 12, 2013 at 7:44 AM, Pierre De Soyres <pierre.d...@eptica.com

wrote:

thank you, this fits my needs

Le mardi 12 février 2013 16:11:37 UTC+1, egaumer a écrit :

You should be able to use a custom tuned word_delimeter to clean up
unwanted punctuation...

egaumer@ares:(src)$ curl -XPUT 'http://localhost:9200/test' -d '{
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 1
},
"analysis" : {
"filter" : {
"my_delimiter" : {
"type" : "word_delimiter",
"split_on_numerics" : true,
"split_on_case_change" : true,
"my_delimiter.catenate_numbers" : true,
"generate_word_parts" : true,
"protected_words": ["C++", "C#"]
}
}
}
}
}'

curl -XGET 'localhost:9200/test/_analyze?****
tokenizer=whitespace&filters=my_delimiter&pretty=1' -d 'Hello, I
write C++ code for wi-fi.'

Test that out and see if it does what you need. You can tweak other
settings on the word_delimeter to meet your needs.

On Tuesday, February 12, 2013 9:09:10 AM UTC-5, Pierre De Soyres wrote:

Thank you for response,

but using 'whitespace' is not an option for me because I need comma,
dot, dash, etc. to be delimiters as well

Pierre.

Le mardi 12 février 2013 14:47:23 UTC+1, egaumer a écrit :

You could use a whitespace tokenizer instead to preserve punctuation
on this field...

curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&pretty=1'
-d 'I write C++ code.'

On Tuesday, February 12, 2013 8:19:57 AM UTC-5, Pierre de Soyres
wrote:

Hi,

I use for some fields the standard tokenizer and I would like to
know if there is a way to prevent strings such as "c++", "c#" or ".net" to
be tokenized as "c", "c" or "net" but to be kept unmodified.

Thanks in advance

Pierre

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Changing tokenizer from whitespace to standard Elasticsearch	4	2559	July 6, 2017
Seperate tokenizer for Search and Indexing Elasticsearch	2	326	July 6, 2017
It is possibile don't token word with elasticsearch? Elasticsearch	3	374	July 6, 2017
Whitespace analyzer Elasticsearch	4	325	July 6, 2017
Analyzer settings for breaking up words on hyphens Elasticsearch	4	2218	July 6, 2017

Protect some words when tokenizing

Related topics