Trivial example of keyword_repeat?

Nikita_Tovstoles · March 12, 2014, 11:38pm

Could someone please share a trivial example of using Keyword Repeat Token
Filter
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-keyword-repeat-tokenfilter.html to:
“[Emit] each incoming token twice once as keyword and once as a non-keyword
to allow an un-stemmed version of a term to be indexed side by si[d]e to
the stemmed version of the term"

Maybe I don’t understand its’ intent but is the idea to be able to tokenize
string “one two” into “one”, “two”, “one two” (last being unstemmed), right?

If so, I tried example config (in the docs) and input was not preserved:

http://localhost:9200/_analyze?pretty=true&text=one%20two&filters=lowercase,keyword_repeat,porter_stem,unique

{    "tokens" : [ {    "token" : "one",    "start_offset" : 0,    "end_offset"

: 3, "type" : "", "position" : 1 }, { "token" : "two", "start_offset"
: 4, "end_offset" : 7, "type" : "", "position" : 2 } ]
}
Shouldn't* there be 3 tokens - “one”, “two”, “one two”?*

…using v 1.0.1

BTW, seems to make more sense to use ‘keyword’ tokenizer instead of
‘standard’ (since latter splits “one two” before filter is even enacted).
but that fails to return “one”, and “two”

http://localhost:9200/_analyze?pretty=true&text=one%20two&filters=lowercase,keyword_repeat,porter_stem,unique&tokenizer=keyword
{
"tokens" : [ {
"token" : "one two",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 1
} ]
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/93d58a7d-07c1-485f-afee-3c2f1e9b994f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

simonw_2 · March 13, 2014, 8:01am

the repeat filter only applies to terms that actually get stemmed. ie if
you have "goes" it will be stemmed to "go" but with the repeat filter it
will also emit "goes" in addition to "go"

makes sense?

simon

On Thursday, March 13, 2014 12:38:00 AM UTC+1, Nikita Tovstoles wrote:

Could someone please share a trivial example of using Keyword Repeat
Token Filter
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-keyword-repeat-tokenfilter.html to:
“[Emit] each incoming token twice once as keyword and once as a non-keyword
to allow an un-stemmed version of a term to be indexed side by si[d]e to
the stemmed version of the term"

Maybe I don’t understand its’ intent but is the idea to be able to
tokenize string “one two” into “one”, “two”, “one two” (last being
unstemmed), right?

If so, I tried example config (in the docs) and input was not preserved:

http://localhost:9200/_analyze?pretty=true&text=one%20two&filters=lowercase,keyword_repeat,porter_stem,unique
{    "tokens" : [ {    "token" : "one",    "start_offset" : 0,    "end_offset" 
: 3, "type" : "", "position" : 1 }, { "token" :
"two", "start_offset" : 4, "end_offset" : 7, "type" :
"", "position" : 2 } ] }
Shouldn't* there be 3 tokens - “one”, “two”, “one two”?*

…using v 1.0.1

BTW, seems to make more sense to use ‘keyword’ tokenizer instead of
‘standard’ (since latter splits “one two” before filter is even enacted).
but that fails to return “one”, and “two”

http://localhost:9200/_analyze?pretty=true&text=one%20two&filters=lowercase,keyword_repeat,porter_stem,unique&tokenizer=keyword
{
"tokens" : [ {
"token" : "one two",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 1
} ]
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ff101eaa-e5e0-45cb-a071-bbe118c8a756%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Problem with keyword_repeat filter Elasticsearch	0	365	May 6, 2013
Comparison of tokens must not be repeated from query side to index document side Elasticsearch	0	398	July 30, 2019
Stop words and Keyword tokenizer Elasticsearch	11	2025	August 29, 2014
Custom analyzer: keyword_marker Elasticsearch	0	463	June 2, 2015
Do entries in a synonym list always get whitespace tokenized? Elasticsearch	4	554	September 27, 2013

Trivial example of keyword_repeat?

Related topics