Precise field matching without defining a not_analyzed extra field possible?


(Nikita Tovstoles) #1

[in short: cannot figure out whether it is possible to support precise
matches at field level without altering type mapping. Seems like
keyword_repeat is the answer but not clear what tokenizer to use with it]

I have a User type with unique 'name' property (and some other properties).
I would like to index my users in a way that allows me to run:

  1. inexact search on name field (ie. 'find users that contain 'foo' in
    name' or 'find users with name similar to 'bob')
  2. precise search on name field ('find users with name === "bob smith").
    this search op should yield at most 1 result (since 'name' is unique).

Using defaults #2 isn't addressed, since {term: {name:bob}} will match
users with names like "bob" but also "bob smith" - since term "bob" is
present for both docs.

I know I can get #2 by resorting to mapping user.name twice (and then
running term query on name.raw):
{
"user": {
"properties": {
"name" : {
"type": "string"
},
"name.raw": {
"type" : "string",
"index": "not_analyzed"
}
}
}
}
I would like to avoid the above because indexing process would become more
complicated (won't I lose the benefit of dynamic mapping and have to
hand-map all other properties of user)?

If I am reading this posthttps://groups.google.com/forum/#!topic/elasticsearch/AUJQGy0A7gEcorrectly, I can address #2 (without losing #1) without manually mapping
extra columns by somehow using Keyword Repeat Token Filterhttp://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-keyword-repeat-tokenfilter.html (am
I reading that correctly?). but if I use these index settings:

"analysis" : {
"analyzer": {
"default": {
"type" : "custom",
"tokenizer": "standard",
"filter" : ["lowercase", "keyword_repeat"]
}
}

then _analyze still does not return "bob jr" as one of tokens for text "bob
jr" - guessing because Standard tokenizer splits "bob jr" into "bob" and
"jr" and thus "keyword_repeat" never sees "bob jr".

On the other hand if I use "keyword" tokenizer than "bob jr" is the sole
token returned (which probably means use case #1 won't be addressable).
Also not clear what purpose would keyword_repeat serve in this case.

Would appreciate someone pointing me in the right direction.

thanks
-nikita

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3b9e039b-16f4-40e9-8026-48b920202f70%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Clinton Gormley) #2

You're almost there with:

On 12 March 2014 21:06, Nikita Tovstoles nikita.tovstoles@gmail.com wrote:

{
"user": {
"properties": {
"name" : {
"type": "string"
},
"name.raw": {
"type" : "string",
"index": "not_analyzed"
}
}
}
}

Instead, use "multi-fields" (note: the syntax changed between 0.90.* and
1.0.*):

{
"mappings": {
"user": {
"properties": {
"name": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}

This gives you the analyzed name fieldand the not_analyzedname.raw`
field.

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPt3XKR_bzHZZbZfxPz_Mz-8MDqsc1JzROU%3DVJK0NhBYeo0cYg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Nikita Tovstoles) #3

Appreciate that Clint. But I was asking whether I could do without having
to modify mappings - see ref to another post seemingly alluding to that
On Mar 12, 2014 1:51 PM, "Clinton Gormley" clint@traveljury.com wrote:

You're almost there with:

On 12 March 2014 21:06, Nikita Tovstoles nikita.tovstoles@gmail.comwrote:

{
"user": {
"properties": {
"name" : {
"type": "string"
},
"name.raw": {
"type" : "string",
"index": "not_analyzed"
}
}
}
}

Instead, use "multi-fields" (note: the syntax changed between 0.90.* and
1.0.*):

{
"mappings": {
"user": {
"properties": {
"name": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}

This gives you the analyzed name fieldand the not_analyzedname.raw`
field.

clint

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/pMondr_iunw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPt3XKR_bzHZZbZfxPz_Mz-8MDqsc1JzROU%3DVJK0NhBYeo0cYg%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAPt3XKR_bzHZZbZfxPz_Mz-8MDqsc1JzROU%3DVJK0NhBYeo0cYg%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJwaA23z9S0gLyQHzyN9Vzy2H_BqRCe8_tufmRPDy6pMMNw95Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Clinton Gormley) #4

Appreciate that Clint. But I was asking whether I could do without having
to modify mappings - see ref to another post seemingly alluding to that

That post refers to using the keyword_repeat token filter to index stemmed
and unstemmed tokens in the same positions. It won't work for your use case
for exactly the reasons that you gave before:

then _analyze still does not return "bob jr" as one of tokens for text "bob

jr" - guessing because Standard tokenizer splits "bob jr" into "bob" and
"jr" and thus "keyword_repeat" never sees "bob jr".

On the other hand if I use "keyword" tokenizer than "bob jr" is the sole
token returned (which probably means use case #1 won't be addressable).
Also not clear what purpose would keyword_repeat serve in this case.

Why don't you like the idea of using multi-fields? It solves your problem
correctly and easily.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPt3XKRYBjVS%2BVu7F3OiRf_mzQg%3DqW20fDvxRV-joiSNKMnqeg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Nikita Tovstoles) #5

I originally thought that using multi-fields would require manual mapping
if the entire data model + thought that keyword_repeat offers an
alternative not requiring mapping changes. After your comments + peeking at
KeywordRepeatFilter src I see I was wrong on both. Thanks for your help!
On Mar 13, 2014 2:17 AM, "Clinton Gormley" clint@traveljury.com wrote:

Appreciate that Clint. But I was asking whether I could do without having

to modify mappings - see ref to another post seemingly alluding to that

That post refers to using the keyword_repeat token filter to index stemmed
and unstemmed tokens in the same positions. It won't work for your use case
for exactly the reasons that you gave before:

then _analyze still does not return "bob jr" as one of tokens for text

"bob jr" - guessing because Standard tokenizer splits "bob jr" into "bob"
and "jr" and thus "keyword_repeat" never sees "bob jr".

On the other hand if I use "keyword" tokenizer than "bob jr" is the sole
token returned (which probably means use case #1 won't be addressable).
Also not clear what purpose would keyword_repeat serve in this case.

Why don't you like the idea of using multi-fields? It solves your problem
correctly and easily.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/pMondr_iunw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPt3XKRYBjVS%2BVu7F3OiRf_mzQg%3DqW20fDvxRV-joiSNKMnqeg%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAPt3XKRYBjVS%2BVu7F3OiRf_mzQg%3DqW20fDvxRV-joiSNKMnqeg%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJwaA21RA6JAWp02Nwhhqd4dshnETBkH%2BCTV-TPqNoHzOqxq_g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Nik Everett) #6

If you plan to do this frequently then go with the raw field. It'll be
faster.

If you want to fool around without changing any mappings then use a script
filter to get the field from the _source. It isn't efficient at all. I'd
suggest guarding it with a more efficient filter. Like so:

"query": {
"filtered": {
"filter": {
"and": [
{
"term": {
"title": "filter"
}
},
{
"script": {
"script": "_source['title'] == 'Filter'"
}
}
]
}
}
}

Nik

On Thu, Mar 13, 2014 at 5:17 AM, Clinton Gormley clint@traveljury.comwrote:

Appreciate that Clint. But I was asking whether I could do without having

to modify mappings - see ref to another post seemingly alluding to that

That post refers to using the keyword_repeat token filter to index stemmed
and unstemmed tokens in the same positions. It won't work for your use case
for exactly the reasons that you gave before:

then _analyze still does not return "bob jr" as one of tokens for text

"bob jr" - guessing because Standard tokenizer splits "bob jr" into "bob"
and "jr" and thus "keyword_repeat" never sees "bob jr".

On the other hand if I use "keyword" tokenizer than "bob jr" is the sole
token returned (which probably means use case #1 won't be addressable).
Also not clear what purpose would keyword_repeat serve in this case.

Why don't you like the idea of using multi-fields? It solves your problem
correctly and easily.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPt3XKRYBjVS%2BVu7F3OiRf_mzQg%3DqW20fDvxRV-joiSNKMnqeg%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAPt3XKRYBjVS%2BVu7F3OiRf_mzQg%3DqW20fDvxRV-joiSNKMnqeg%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd2wjK0W2tbrhusB%3DqOD27HX4hUva_K0zWLnibX%3DBhJh-Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Nik Everett) #7

Missed your last email. Ignore my suggestion and use the raw field:)

On Thu, Mar 13, 2014 at 8:50 AM, Nikolas Everett nik9000@gmail.com wrote:

If you plan to do this frequently then go with the raw field. It'll be
faster.

If you want to fool around without changing any mappings then use a script
filter to get the field from the _source. It isn't efficient at all. I'd
suggest guarding it with a more efficient filter. Like so:

"query": {
"filtered": {
"filter": {
"and": [
{
"term": {
"title": "filter"
}
},
{
"script": {
"script": "_source['title'] == 'Filter'"
}
}
]
}
}
}

Nik

On Thu, Mar 13, 2014 at 5:17 AM, Clinton Gormley clint@traveljury.comwrote:

Appreciate that Clint. But I was asking whether I could do without having

to modify mappings - see ref to another post seemingly alluding to that

That post refers to using the keyword_repeat token filter to index
stemmed and unstemmed tokens in the same positions. It won't work for your
use case for exactly the reasons that you gave before:

then _analyze still does not return "bob jr" as one of tokens for text

"bob jr" - guessing because Standard tokenizer splits "bob jr" into "bob"
and "jr" and thus "keyword_repeat" never sees "bob jr".

On the other hand if I use "keyword" tokenizer than "bob jr" is the sole
token returned (which probably means use case #1 won't be addressable).
Also not clear what purpose would keyword_repeat serve in this case.

Why don't you like the idea of using multi-fields? It solves your
problem correctly and easily.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPt3XKRYBjVS%2BVu7F3OiRf_mzQg%3DqW20fDvxRV-joiSNKMnqeg%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAPt3XKRYBjVS%2BVu7F3OiRf_mzQg%3DqW20fDvxRV-joiSNKMnqeg%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd0TF9KDpw4BmiK5xyVZDa6jkhBLd5EMeeaiGGtF4q64Lw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #8