[in short: cannot figure out whether it is possible to support precise
matches at field level without altering type mapping. Seems like
keyword_repeat is the answer but not clear what tokenizer to use with it]
I have a User type with unique 'name' property (and some other properties).
I would like to index my users in a way that allows me to run:
inexact search on name field (ie. 'find users that contain 'foo' in
name' or 'find users with name similar to 'bob')
precise search on name field ('find users with name === "bob smith").
this search op should yield at most 1 result (since 'name' is unique).
Using defaults #2 isn't addressed, since {term: {name:bob}} will match
users with names like "bob" but also "bob smith" - since term "bob" is
present for both docs.
I know I can get #2 by resorting to mapping user.name twice (and then
running term query on name.raw):
{
"user": {
"properties": {
"name" : {
"type": "string"
},
"name.raw": {
"type" : "string",
"index": "not_analyzed"
}
}
}
}
I would like to avoid the above because indexing process would become more
complicated (won't I lose the benefit of dynamic mapping and have to
hand-map all other properties of user)?
then _analyze still does not return "bob jr" as one of tokens for text "bob
jr" - guessing because Standard tokenizer splits "bob jr" into "bob" and
"jr" and thus "keyword_repeat" never sees "bob jr".
On the other hand if I use "keyword" tokenizer than "bob jr" is the sole
token returned (which probably means use case #1 won't be addressable).
Also not clear what purpose would keyword_repeat serve in this case.
Would appreciate someone pointing me in the right direction.
Appreciate that Clint. But I was asking whether I could do without having
to modify mappings - see ref to another post seemingly alluding to that
On Mar 12, 2014 1:51 PM, "Clinton Gormley" clint@traveljury.com wrote:
Appreciate that Clint. But I was asking whether I could do without having
to modify mappings - see ref to another post seemingly alluding to that
That post refers to using the keyword_repeat token filter to index stemmed
and unstemmed tokens in the same positions. It won't work for your use case
for exactly the reasons that you gave before:
then _analyze still does not return "bob jr" as one of tokens for text "bob
jr" - guessing because Standard tokenizer splits "bob jr" into "bob" and
"jr" and thus "keyword_repeat" never sees "bob jr".
On the other hand if I use "keyword" tokenizer than "bob jr" is the sole
token returned (which probably means use case #1 won't be addressable).
Also not clear what purpose would keyword_repeat serve in this case.
Why don't you like the idea of using multi-fields? It solves your problem
correctly and easily.
I originally thought that using multi-fields would require manual mapping
if the entire data model + thought that keyword_repeat offers an
alternative not requiring mapping changes. After your comments + peeking at
KeywordRepeatFilter src I see I was wrong on both. Thanks for your help!
On Mar 13, 2014 2:17 AM, "Clinton Gormley" clint@traveljury.com wrote:
Appreciate that Clint. But I was asking whether I could do without having
to modify mappings - see ref to another post seemingly alluding to that
That post refers to using the keyword_repeat token filter to index stemmed
and unstemmed tokens in the same positions. It won't work for your use case
for exactly the reasons that you gave before:
then _analyze still does not return "bob jr" as one of tokens for text
"bob jr" - guessing because Standard tokenizer splits "bob jr" into "bob"
and "jr" and thus "keyword_repeat" never sees "bob jr".
On the other hand if I use "keyword" tokenizer than "bob jr" is the sole
token returned (which probably means use case #1 won't be addressable).
Also not clear what purpose would keyword_repeat serve in this case.
Why don't you like the idea of using multi-fields? It solves your problem
correctly and easily.
If you plan to do this frequently then go with the raw field. It'll be
faster.
If you want to fool around without changing any mappings then use a script
filter to get the field from the _source. It isn't efficient at all. I'd
suggest guarding it with a more efficient filter. Like so:
Appreciate that Clint. But I was asking whether I could do without having
to modify mappings - see ref to another post seemingly alluding to that
That post refers to using the keyword_repeat token filter to index stemmed
and unstemmed tokens in the same positions. It won't work for your use case
for exactly the reasons that you gave before:
then _analyze still does not return "bob jr" as one of tokens for text
"bob jr" - guessing because Standard tokenizer splits "bob jr" into "bob"
and "jr" and thus "keyword_repeat" never sees "bob jr".
On the other hand if I use "keyword" tokenizer than "bob jr" is the sole
token returned (which probably means use case #1 won't be addressable).
Also not clear what purpose would keyword_repeat serve in this case.
Why don't you like the idea of using multi-fields? It solves your problem
correctly and easily.
Missed your last email. Ignore my suggestion and use the raw field:)
On Thu, Mar 13, 2014 at 8:50 AM, Nikolas Everett nik9000@gmail.com wrote:
If you plan to do this frequently then go with the raw field. It'll be
faster.
If you want to fool around without changing any mappings then use a script
filter to get the field from the _source. It isn't efficient at all. I'd
suggest guarding it with a more efficient filter. Like so:
Appreciate that Clint. But I was asking whether I could do without having
to modify mappings - see ref to another post seemingly alluding to that
That post refers to using the keyword_repeat token filter to index
stemmed and unstemmed tokens in the same positions. It won't work for your
use case for exactly the reasons that you gave before:
then _analyze still does not return "bob jr" as one of tokens for text
"bob jr" - guessing because Standard tokenizer splits "bob jr" into "bob"
and "jr" and thus "keyword_repeat" never sees "bob jr".
On the other hand if I use "keyword" tokenizer than "bob jr" is the sole
token returned (which probably means use case #1 won't be addressable).
Also not clear what purpose would keyword_repeat serve in this case.
Why don't you like the idea of using multi-fields? It solves your
problem correctly and easily.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.