I use for some fields the standard tokenizer and I would like to know if
there is a way to prevent strings such as "c++", "c#" or ".net" to be
tokenized as "c", "c" or "net" but to be kept unmodified.
You could use a whitespace tokenizer instead to preserve punctuation on
this field...
curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&pretty=1' -d 'I
write C++ code.'
On Tuesday, February 12, 2013 8:19:57 AM UTC-5, Pierre de Soyres wrote:
Hi,
I use for some fields the standard tokenizer and I would like to know if
there is a way to prevent strings such as "c++", "c#" or ".net" to be
tokenized as "c", "c" or "net" but to be kept unmodified.
but using 'whitespace' is not an option for me because I need comma, dot,
dash, etc. to be delimiters as well
Pierre.
Le mardi 12 février 2013 14:47:23 UTC+1, egaumer a écrit :
You could use a whitespace tokenizer instead to preserve punctuation on
this field...
curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&pretty=1' -d 'I
write C++ code.'
On Tuesday, February 12, 2013 8:19:57 AM UTC-5, Pierre de Soyres wrote:
Hi,
I use for some fields the standard tokenizer and I would like to know if
there is a way to prevent strings such as "c++", "c#" or ".net" to be
tokenized as "c", "c" or "net" but to be kept unmodified.
curl -XGET
'localhost:9200/test/_analyze?tokenizer=whitespace&filters=my_delimiter&pretty=1'
-d 'Hello, I write C++ code for wi-fi.'
Test that out and see if it does what you need. You can tweak other
settings on the word_delimeter to meet your needs.
On Tuesday, February 12, 2013 9:09:10 AM UTC-5, Pierre De Soyres wrote:
Thank you for response,
but using 'whitespace' is not an option for me because I need comma, dot,
dash, etc. to be delimiters as well
Pierre.
Le mardi 12 février 2013 14:47:23 UTC+1, egaumer a écrit :
You could use a whitespace tokenizer instead to preserve punctuation on
this field...
curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&pretty=1' -d 'I
write C++ code.'
On Tuesday, February 12, 2013 8:19:57 AM UTC-5, Pierre de Soyres wrote:
Hi,
I use for some fields the standard tokenizer and I would like to know if
there is a way to prevent strings such as "c++", "c#" or ".net" to be
tokenized as "c", "c" or "net" but to be kept unmodified.
curl -XGET
'localhost:9200/test/_analyze?tokenizer=whitespace&filters=my_delimiter&pretty=1'
-d 'Hello, I write C++ code for wi-fi.'
Test that out and see if it does what you need. You can tweak other
settings on the word_delimeter to meet your needs.
On Tuesday, February 12, 2013 9:09:10 AM UTC-5, Pierre De Soyres wrote:
Thank you for response,
but using 'whitespace' is not an option for me because I need comma, dot,
dash, etc. to be delimiters as well
Pierre.
Le mardi 12 février 2013 14:47:23 UTC+1, egaumer a écrit :
You could use a whitespace tokenizer instead to preserve punctuation on
this field...
curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&pretty=1' -d 'I
write C++ code.'
On Tuesday, February 12, 2013 8:19:57 AM UTC-5, Pierre de Soyres wrote:
Hi,
I use for some fields the standard tokenizer and I would like to know
if there is a way to prevent strings such as "c++", "c#" or ".net" to be
tokenized as "c", "c" or "net" but to be kept unmodified.
curl -XGET 'localhost:9200/test/_analyze?*tokenizer=whitespace&filters=
*my_delimiter&pretty=1' -d 'Hello, I write C++ code for wi-fi.'
Test that out and see if it does what you need. You can tweak other
settings on the word_delimeter to meet your needs.
On Tuesday, February 12, 2013 9:09:10 AM UTC-5, Pierre De Soyres wrote:
Thank you for response,
but using 'whitespace' is not an option for me because I need comma,
dot, dash, etc. to be delimiters as well
Pierre.
Le mardi 12 février 2013 14:47:23 UTC+1, egaumer a écrit :
You could use a whitespace tokenizer instead to preserve punctuation on
this field...
curl -XGET 'localhost:9200/_analyze?**tokenizer=whitespace&pretty=1'
-d 'I write C++ code.'
On Tuesday, February 12, 2013 8:19:57 AM UTC-5, Pierre de Soyres wrote:
Hi,
I use for some fields the standard tokenizer and I would like to know
if there is a way to prevent strings such as "c++", "c#" or ".net" to be
tokenized as "c", "c" or "net" but to be kept unmodified.
Ivan, unfortunately the keywordMarkerFilter only works for in combination
with stemmers.I added the keyword attribute years ago to prevent some
stemmers from running the stemming alg on terms that are known to be names
etc. I don't think this would help here.
In general I would recommend to use a simple tokenizer like whitespace and
then use synonym filter to transform these kind of token (c++ / c#) to a
text represenations (cPLUSPLUS / CSHARP) then you can go wild with
WordDelimiterFilter etc. once you did this mapping.
simon
On Friday, February 15, 2013 5:19:23 PM UTC+1, Ivan Brusic wrote:
If you know the list of keywords to protect, you can also use a Keyword
Marker Token Filter.
curl -XGET 'localhost:9200/test/_analyze?**tokenizer=whitespace&filters=
**my_delimiter&pretty=1' -d 'Hello, I write C++ code for wi-fi.'
Test that out and see if it does what you need. You can tweak other
settings on the word_delimeter to meet your needs.
On Tuesday, February 12, 2013 9:09:10 AM UTC-5, Pierre De Soyres wrote:
Thank you for response,
but using 'whitespace' is not an option for me because I need comma,
dot, dash, etc. to be delimiters as well
Pierre.
Le mardi 12 février 2013 14:47:23 UTC+1, egaumer a écrit :
You could use a whitespace tokenizer instead to preserve punctuation
on this field...
curl -XGET 'localhost:9200/_analyze?**tokenizer=whitespace&pretty=1'
-d 'I write C++ code.'
On Tuesday, February 12, 2013 8:19:57 AM UTC-5, Pierre de Soyres wrote:
Hi,
I use for some fields the standard tokenizer and I would like to know
if there is a way to prevent strings such as "c++", "c#" or ".net" to be
tokenized as "c", "c" or "net" but to be kept unmodified.
Thanks in advance
Pierre
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.
My mistake! I read the word "protect" and thought of the keyword marker
filter. I once wrote a custom token filter on a Lucene project I was on,
not related to stemming, that used the keyword attributes. Useful
attribute, but it is post tokenization and not what the OP is looking for.
Nowadays in Lucene I use a pattern tokenizer since the whitespace tokenizer
is too lenient, plus a word_delimiter filter (and stemmer overrides).
Ivan, unfortunately the keywordMarkerFilter only works for in combination
with stemmers.I added the keyword attribute years ago to prevent some
stemmers from running the stemming alg on terms that are known to be names
etc. I don't think this would help here.
In general I would recommend to use a simple tokenizer like whitespace and
then use synonym filter to transform these kind of token (c++ / c#) to a
text represenations (cPLUSPLUS / CSHARP) then you can go wild with
WordDelimiterFilter etc. once you did this mapping.
simon
On Friday, February 15, 2013 5:19:23 PM UTC+1, Ivan Brusic wrote:
If you know the list of keywords to protect, you can also use a Keyword
Marker Token Filter.
curl -XGET 'localhost:9200/test/_analyze?****
tokenizer=whitespace&filters=my_delimiter&pretty=1' -d 'Hello, I
write C++ code for wi-fi.'
Test that out and see if it does what you need. You can tweak other
settings on the word_delimeter to meet your needs.
On Tuesday, February 12, 2013 9:09:10 AM UTC-5, Pierre De Soyres wrote:
Thank you for response,
but using 'whitespace' is not an option for me because I need comma,
dot, dash, etc. to be delimiters as well
Pierre.
Le mardi 12 février 2013 14:47:23 UTC+1, egaumer a écrit :
You could use a whitespace tokenizer instead to preserve punctuation
on this field...
curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&pretty=1'
-d 'I write C++ code.'
On Tuesday, February 12, 2013 8:19:57 AM UTC-5, Pierre de Soyres
wrote:
Hi,
I use for some fields the standard tokenizer and I would like to
know if there is a way to prevent strings such as "c++", "c#" or ".net" to
be tokenized as "c", "c" or "net" but to be kept unmodified.
Thanks in advance
Pierre
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.