Precise field matching without defining a not_analyzed extra field possible?

Nikita_Tovstoles · March 12, 2014, 8:06pm

[in short: cannot figure out whether it is possible to support precise
matches at field level without altering type mapping. Seems like
keyword_repeat is the answer but not clear what tokenizer to use with it]

I have a User type with unique 'name' property (and some other properties).
I would like to index my users in a way that allows me to run:

inexact search on name field (ie. 'find users that contain 'foo' in
name' or 'find users with name similar to 'bob')
precise search on name field ('find users with name === "bob smith").
this search op should yield at most 1 result (since 'name' is unique).

Using defaults #2 isn't addressed, since {term: {name:bob}} will match
users with names like "bob" but also "bob smith" - since term "bob" is
present for both docs.

I know I can get #2 by resorting to mapping user.name twice (and then
running term query on name.raw):
{
"user": {
"properties": {
"name" : {
"type": "string"
},
"name.raw": {
"type" : "string",
"index": "not_analyzed"
}
}
}
}
I would like to avoid the above because indexing process would become more
complicated (won't I lose the benefit of dynamic mapping and have to
hand-map all other properties of user)?

If I am reading this posthttps://groups.google.com/forum/#!topic/elasticsearch/AUJQGy0A7gEcorrectly, I can address #2 (without losing #1) without manually mapping
extra columns by somehow using Keyword Repeat Token Filterhttp://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-keyword-repeat-tokenfilter.html (am
I reading that correctly?). but if I use these index settings:

"analysis" : {
"analyzer": {
"default": {
"type" : "custom",
"tokenizer": "standard",
"filter" : ["lowercase", "keyword_repeat"]
}
}

then _analyze still does not return "bob jr" as one of tokens for text "bob
jr" - guessing because Standard tokenizer splits "bob jr" into "bob" and
"jr" and thus "keyword_repeat" never sees "bob jr".

On the other hand if I use "keyword" tokenizer than "bob jr" is the sole
token returned (which probably means use case #1 won't be addressable).
Also not clear what purpose would keyword_repeat serve in this case.

Would appreciate someone pointing me in the right direction.

thanks
-nikita

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3b9e039b-16f4-40e9-8026-48b920202f70%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Clinton_Gormley · March 12, 2014, 8:51pm

You're almost there with:

On 12 March 2014 21:06, Nikita Tovstoles nikita.tovstoles@gmail.com wrote:

{
"user": {
"properties": {
"name" : {
"type": "string"
},
"name.raw": {
"type" : "string",
"index": "not_analyzed"
}
}
}
}

Instead, use "multi-fields" (note: the syntax changed between 0.90.* and
1.0.*):

{
"mappings": {
"user": {
"properties": {
"name": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}

This gives you the analyzed name fieldand the not_analyzedname.raw`
field.

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPt3XKR_bzHZZbZfxPz_Mz-8MDqsc1JzROU%3DVJK0NhBYeo0cYg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Nikita_Tovstoles · March 12, 2014, 9:24pm

Appreciate that Clint. But I was asking whether I could do without having
to modify mappings - see ref to another post seemingly alluding to that
On Mar 12, 2014 1:51 PM, "Clinton Gormley" clint@traveljury.com wrote:

You're almost there with:

On 12 March 2014 21:06, Nikita Tovstoles nikita.tovstoles@gmail.comwrote:

{
"user": {
"properties": {
"name" : {
"type": "string"
},
"name.raw": {
"type" : "string",
"index": "not_analyzed"
}
}
}
}

Instead, use "multi-fields" (note: the syntax changed between 0.90.* and
1.0.*):

{
"mappings": {
"user": {
"properties": {
"name": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}

This gives you the analyzed name fieldand the not_analyzedname.raw`
field.

clint

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/pMondr_iunw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPt3XKR_bzHZZbZfxPz_Mz-8MDqsc1JzROU%3DVJK0NhBYeo0cYg%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAPt3XKR_bzHZZbZfxPz_Mz-8MDqsc1JzROU%3DVJK0NhBYeo0cYg%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJwaA23z9S0gLyQHzyN9Vzy2H_BqRCe8_tufmRPDy6pMMNw95Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Clinton_Gormley · March 13, 2014, 9:17am

Appreciate that Clint. But I was asking whether I could do without having
to modify mappings - see ref to another post seemingly alluding to that

That post refers to using the keyword_repeat token filter to index stemmed
and unstemmed tokens in the same positions. It won't work for your use case
for exactly the reasons that you gave before:

then _analyze still does not return "bob jr" as one of tokens for text "bob

jr" - guessing because Standard tokenizer splits "bob jr" into "bob" and
"jr" and thus "keyword_repeat" never sees "bob jr".

On the other hand if I use "keyword" tokenizer than "bob jr" is the sole
token returned (which probably means use case #1 won't be addressable).
Also not clear what purpose would keyword_repeat serve in this case.

Why don't you like the idea of using multi-fields? It solves your problem
correctly and easily.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPt3XKRYBjVS%2BVu7F3OiRf_mzQg%3DqW20fDvxRV-joiSNKMnqeg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Nikita_Tovstoles · March 13, 2014, 12:49pm

I originally thought that using multi-fields would require manual mapping
if the entire data model + thought that keyword_repeat offers an
alternative not requiring mapping changes. After your comments + peeking at
KeywordRepeatFilter src I see I was wrong on both. Thanks for your help!
On Mar 13, 2014 2:17 AM, "Clinton Gormley" clint@traveljury.com wrote:

Appreciate that Clint. But I was asking whether I could do without having

to modify mappings - see ref to another post seemingly alluding to that

That post refers to using the keyword_repeat token filter to index stemmed
and unstemmed tokens in the same positions. It won't work for your use case
for exactly the reasons that you gave before:

then _analyze still does not return "bob jr" as one of tokens for text

"bob jr" - guessing because Standard tokenizer splits "bob jr" into "bob"
and "jr" and thus "keyword_repeat" never sees "bob jr".

On the other hand if I use "keyword" tokenizer than "bob jr" is the sole
token returned (which probably means use case #1 won't be addressable).
Also not clear what purpose would keyword_repeat serve in this case.

Why don't you like the idea of using multi-fields? It solves your problem
correctly and easily.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/pMondr_iunw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPt3XKRYBjVS%2BVu7F3OiRf_mzQg%3DqW20fDvxRV-joiSNKMnqeg%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAPt3XKRYBjVS%2BVu7F3OiRf_mzQg%3DqW20fDvxRV-joiSNKMnqeg%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJwaA21RA6JAWp02Nwhhqd4dshnETBkH%2BCTV-TPqNoHzOqxq_g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

nik9000 · March 13, 2014, 12:50pm

If you plan to do this frequently then go with the raw field. It'll be
faster.

If you want to fool around without changing any mappings then use a script
filter to get the field from the _source. It isn't efficient at all. I'd
suggest guarding it with a more efficient filter. Like so:

"query": {
"filtered": {
"filter": {
"and": [
{
"term": {
"title": "filter"
}
},
{
"script": {
"script": "_source['title'] == 'Filter'"
}
}
]
}
}
}

Nik

On Thu, Mar 13, 2014 at 5:17 AM, Clinton Gormley clint@traveljury.comwrote:

Appreciate that Clint. But I was asking whether I could do without having

to modify mappings - see ref to another post seemingly alluding to that

That post refers to using the keyword_repeat token filter to index stemmed
and unstemmed tokens in the same positions. It won't work for your use case
for exactly the reasons that you gave before:

then _analyze still does not return "bob jr" as one of tokens for text

"bob jr" - guessing because Standard tokenizer splits "bob jr" into "bob"
and "jr" and thus "keyword_repeat" never sees "bob jr".

On the other hand if I use "keyword" tokenizer than "bob jr" is the sole
token returned (which probably means use case #1 won't be addressable).
Also not clear what purpose would keyword_repeat serve in this case.

Why don't you like the idea of using multi-fields? It solves your problem
correctly and easily.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPt3XKRYBjVS%2BVu7F3OiRf_mzQg%3DqW20fDvxRV-joiSNKMnqeg%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAPt3XKRYBjVS%2BVu7F3OiRf_mzQg%3DqW20fDvxRV-joiSNKMnqeg%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd2wjK0W2tbrhusB%3DqOD27HX4hUva_K0zWLnibX%3DBhJh-Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

nik9000 · March 13, 2014, 12:50pm

Missed your last email. Ignore my suggestion and use the raw field:)

On Thu, Mar 13, 2014 at 8:50 AM, Nikolas Everett nik9000@gmail.com wrote:

If you plan to do this frequently then go with the raw field. It'll be
faster.

If you want to fool around without changing any mappings then use a script
filter to get the field from the _source. It isn't efficient at all. I'd
suggest guarding it with a more efficient filter. Like so:

"query": {
"filtered": {
"filter": {
"and": [
{
"term": {
"title": "filter"
}
},
{
"script": {
"script": "_source['title'] == 'Filter'"
}
}
]
}
}
}

Nik

On Thu, Mar 13, 2014 at 5:17 AM, Clinton Gormley clint@traveljury.comwrote:

Appreciate that Clint. But I was asking whether I could do without having

to modify mappings - see ref to another post seemingly alluding to that

That post refers to using the keyword_repeat token filter to index
stemmed and unstemmed tokens in the same positions. It won't work for your
use case for exactly the reasons that you gave before:

then _analyze still does not return "bob jr" as one of tokens for text

"bob jr" - guessing because Standard tokenizer splits "bob jr" into "bob"
and "jr" and thus "keyword_repeat" never sees "bob jr".

On the other hand if I use "keyword" tokenizer than "bob jr" is the sole
token returned (which probably means use case #1 won't be addressable).
Also not clear what purpose would keyword_repeat serve in this case.

Why don't you like the idea of using multi-fields? It solves your
problem correctly and easily.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPt3XKRYBjVS%2BVu7F3OiRf_mzQg%3DqW20fDvxRV-joiSNKMnqeg%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAPt3XKRYBjVS%2BVu7F3OiRf_mzQg%3DqW20fDvxRV-joiSNKMnqeg%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd0TF9KDpw4BmiK5xyVZDa6jkhBLd5EMeeaiGGtF4q64Lw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
How I can do exact search by "not_analyzed" fields? Elasticsearch	4	2889	July 6, 2017
Exact Phrase Match on a not_analyzed field with a space in the phrase Elasticsearch	3	1346	July 6, 2017
The term(s) filter and the standard analyzer Elasticsearch	5	851	July 5, 2017
Match Exact Value of a Field and not be Included as a Subset in That Field, not more not less Elasticsearch	1	335	April 16, 2019
Reading analyzed fields as not-analyzed, and excluding parts of not-analyzed fields Elasticsearch	6	2001	July 5, 2017

Precise field matching without defining a not_analyzed extra field possible?

Related topics