_all analyzer advice


#1

Hi all,

I have a google-style search capability in my app that uses the _all field
with the default (standard) analyzer (I don't configure anything - so its
Elastic's default).

There are a few cases where we don't quite get the behaviour we want, and I
am trying to work out how I tweak the analyzer configuration.

  1. if the user searches using 99.97, then they get the results they expect,
    but if they search using 99.97%, they get nothing. They should get the
    results that match "99.97%". The default analyzer config loses the %, I
    guess.

  2. I have no idea what the text is ( : ) ) but the user wants to search
    using 托克金通贸易 - which is in the data - but currently we get zero results. It
    looks like the standard analyzer/tokenizer breaks on each character.

I think I just want a whitespace analyzer with lower-casing ....
However,
a) I am not exactly sure how to configure that, and;
b) I am not 100% sure what I am losing/gaining vs standard analyzer. (dont
need stop-words - in any case default cfg for standard analyser doesn't
have any IIRC)

(FWIW, on all our other text fields, we tend to use no analyzer)

(Elastic 1.1.1 and 1.2 ...)

Cheers.
-M

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6c6112e3-bdfb-4664-9fb6-b4b3c87f938f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Glen Smith) #2

You can set up an analyser for your index...

...
"my-index": {
"analysis": {
"analyzer": {
"default_index": {
"tokenizer": "standard",
"filter": ["standard", "icu_fold_filter", "stop"]
},
"default_search": {
"tokenizer": "standard",
"filter": ["standard", "icu_fold_filter", "stop"]
},
"custom_index": {
"tokenizer": "whitespace",
"filter": ["lower"]
},
"custom_search": {
"tokenizer": "whitespace",
"filter": ["lower"]
}
}
}
}
...

and then map your relevant field accordingly:

{
"_timestamp": {
"enabled": "true",
"store": "yes"
},
"properties": {
"my_field": {
"type": "string",
"index_analyzer": "custom_index",
"search_analyzer": "custom_search"
}
}
}

Note that you can (and often should) set up index analysis and search
analysis differently (eg if you use synonyms, only expand search terms).

Hope I haven't missed the point...

On Monday, June 30, 2014 8:47:36 AM UTC-4, mooky wrote:

Hi all,

I have a google-style search capability in my app that uses the _all field
with the default (standard) analyzer (I don't configure anything - so its
Elastic's default).

There are a few cases where we don't quite get the behaviour we want, and
I am trying to work out how I tweak the analyzer configuration.

  1. if the user searches using 99.97, then they get the results they
    expect, but if they search using 99.97%, they get nothing. They should get
    the results that match "99.97%". The default analyzer config loses the %, I
    guess.

  2. I have no idea what the text is ( : ) ) but the user wants to search
    using 托克金通贸易 - which is in the data - but currently we get zero results. It
    looks like the standard analyzer/tokenizer breaks on each character.

I think I just want a whitespace analyzer with lower-casing ....
However,
a) I am not exactly sure how to configure that, and;
b) I am not 100% sure what I am losing/gaining vs standard analyzer. (dont
need stop-words - in any case default cfg for standard analyser doesn't
have any IIRC)

(FWIW, on all our other text fields, we tend to use no analyzer)

(Elastic 1.1.1 and 1.2 ...)

Cheers.
-M

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/75ee71a8-6533-4a71-bef5-ac59a7d16115%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


#3

Excellent. Thanks for the info.

Is it possible to set my custom analyser as the default analyser for an
index (ie instead of standard_analyzer)

-N

On Monday, 30 June 2014 14:41:10 UTC+1, Glen Smith wrote:

You can set up an analyser for your index...

...
"my-index": {
"analysis": {
"analyzer": {
"default_index": {
"tokenizer": "standard",
"filter": ["standard", "icu_fold_filter", "stop"]
},
"default_search": {
"tokenizer": "standard",
"filter": ["standard", "icu_fold_filter", "stop"]
},
"custom_index": {
"tokenizer": "whitespace",
"filter": ["lower"]
},
"custom_search": {
"tokenizer": "whitespace",
"filter": ["lower"]
}
}
}
}
...

and then map your relevant field accordingly:

{
"_timestamp": {
"enabled": "true",
"store": "yes"
},
"properties": {
"my_field": {
"type": "string",
"index_analyzer": "custom_index",
"search_analyzer": "custom_search"
}
}
}

Note that you can (and often should) set up index analysis and search
analysis differently (eg if you use synonyms, only expand search terms).

Hope I haven't missed the point...

On Monday, June 30, 2014 8:47:36 AM UTC-4, mooky wrote:

Hi all,

I have a google-style search capability in my app that uses the _all
field with the default (standard) analyzer (I don't configure anything - so
its Elastic's default).

There are a few cases where we don't quite get the behaviour we want, and
I am trying to work out how I tweak the analyzer configuration.

  1. if the user searches using 99.97, then they get the results they
    expect, but if they search using 99.97%, they get nothing. They should get
    the results that match "99.97%". The default analyzer config loses the %, I
    guess.

  2. I have no idea what the text is ( : ) ) but the user wants to search
    using 托克金通贸易 - which is in the data - but currently we get zero results. It
    looks like the standard analyzer/tokenizer breaks on each character.

I think I just want a whitespace analyzer with lower-casing ....
However,
a) I am not exactly sure how to configure that, and;
b) I am not 100% sure what I am losing/gaining vs standard analyzer.
(dont need stop-words - in any case default cfg for standard analyser
doesn't have any IIRC)

(FWIW, on all our other text fields, we tend to use no analyzer)

(Elastic 1.1.1 and 1.2 ...)

Cheers.
-M

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ea2f0a12-1a51-40a1-983e-f3265fae29eb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Glen Smith) #4

Totally. For example:

        "analyzer": {
            "default_index": {
                "tokenizer": "standard",
                "filter": ["standard", "lowercase"]
            },
            "default_search": {
                "tokenizer": "standard",
                "filter": ["standard", "lowercase", "stop"]
            },

On Monday, June 30, 2014 12:19:55 PM UTC-4, mooky wrote:

Excellent. Thanks for the info.

Is it possible to set my custom analyser as the default analyser for an
index (ie instead of standard_analyzer)

-N

On Monday, 30 June 2014 14:41:10 UTC+1, Glen Smith wrote:

You can set up an analyser for your index...

...
"my-index": {
"analysis": {
"analyzer": {
"default_index": {
"tokenizer": "standard",
"filter": ["standard", "icu_fold_filter", "stop"]
},
"default_search": {
"tokenizer": "standard",
"filter": ["standard", "icu_fold_filter", "stop"]
},
"custom_index": {
"tokenizer": "whitespace",
"filter": ["lower"]
},
"custom_search": {
"tokenizer": "whitespace",
"filter": ["lower"]
}
}
}
}
...

and then map your relevant field accordingly:

{
"_timestamp": {
"enabled": "true",
"store": "yes"
},
"properties": {
"my_field": {
"type": "string",
"index_analyzer": "custom_index",
"search_analyzer": "custom_search"
}
}
}

Note that you can (and often should) set up index analysis and search
analysis differently (eg if you use synonyms, only expand search terms).

Hope I haven't missed the point...

On Monday, June 30, 2014 8:47:36 AM UTC-4, mooky wrote:

Hi all,

I have a google-style search capability in my app that uses the _all
field with the default (standard) analyzer (I don't configure anything - so
its Elastic's default).

There are a few cases where we don't quite get the behaviour we want,
and I am trying to work out how I tweak the analyzer configuration.

  1. if the user searches using 99.97, then they get the results they
    expect, but if they search using 99.97%, they get nothing. They should get
    the results that match "99.97%". The default analyzer config loses the %, I
    guess.

  2. I have no idea what the text is ( : ) ) but the user wants to search
    using 托克金通贸易 - which is in the data - but currently we get zero results. It
    looks like the standard analyzer/tokenizer breaks on each character.

I think I just want a whitespace analyzer with lower-casing ....
However,
a) I am not exactly sure how to configure that, and;
b) I am not 100% sure what I am losing/gaining vs standard analyzer.
(dont need stop-words - in any case default cfg for standard analyser
doesn't have any IIRC)

(FWIW, on all our other text fields, we tend to use no analyzer)

(Elastic 1.1.1 and 1.2 ...)

Cheers.
-M

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/63eeca9b-27ca-45da-9b57-d688add036e9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Robbie) #5

Hi Glen,
On a related note, I have a use case where I want to search using
wild-cards on a custom analyzed field. I am currently seeing some
discrepancies w.r.t what I expect.

Basically, I have string data in a field such as "Name-55", "Name-56" etc.
I want to be able to search for "Name-5*", and get these results.

I have indexed the data as terms
"Name", "-", "55"
"Name", "-", "56"

I am using a custom pattern analyzer to achieve this. I am using a similar
custom pattern analyzer for my query string, except that I am swallowing
&,? and *.

"my_template" : {
"template" : "",
"order": 1,
"settings" :{
"analysis": {
"analyzer": {
"custom_index":{
"type": "pattern",
"pattern":"([\s]+)|((?<=\p{L})(?=\P{L})|((?<=\P{L})(?=\p{L}))|((?<=\d)(?=\D))|((?<=\D)(?=\d)))"
},
"custom_search":{
"type": "pattern",
"pattern":"([?&
\s]+)|((?<=\p{L})(?=\P{L})|((?<=\P{L})(?=\p{L}))|((?<=\d)(?=\D))|((?<=\D)(?=\d)))"
}
}
}
},
"mappings" : {
"account" : {
"properties" : {
"myfield" : {
"type" : "string",
"store" : "yes",
"index" : "analyzed",
"index_analyzer" :"custom_index",
"search_analyzer":"custom_search"
}}}}}}

Using this, I see that when I search for "Name-5*", I do not get any
results returned.

However, if I search for "Name- 5*" (Note additional white-space in the
search string), then I get the results Name-55 and Name-56.

Do you have an understanding of why elasticsearch may be exhibiting this
behavior? Is there some issue in the way I have setup the patterns in my
analyzer?

Your help is much appreciated!

Thanks,

On Monday, June 30, 2014 9:21:40 AM UTC-7, Glen Smith wrote:

Totally. For example:

        "analyzer": {
            "default_index": {
                "tokenizer": "standard",
                "filter": ["standard", "lowercase"]
            },
            "default_search": {
                "tokenizer": "standard",
                "filter": ["standard", "lowercase", "stop"]
            },

On Monday, June 30, 2014 12:19:55 PM UTC-4, mooky wrote:

Excellent. Thanks for the info.

Is it possible to set my custom analyser as the default analyser for an
index (ie instead of standard_analyzer)

-N

On Monday, 30 June 2014 14:41:10 UTC+1, Glen Smith wrote:

You can set up an analyser for your index...

...
"my-index": {
"analysis": {
"analyzer": {
"default_index": {
"tokenizer": "standard",
"filter": ["standard", "icu_fold_filter", "stop"]
},
"default_search": {
"tokenizer": "standard",
"filter": ["standard", "icu_fold_filter", "stop"]
},
"custom_index": {
"tokenizer": "whitespace",
"filter": ["lower"]
},
"custom_search": {
"tokenizer": "whitespace",
"filter": ["lower"]
}
}
}
}
...

and then map your relevant field accordingly:

{
"_timestamp": {
"enabled": "true",
"store": "yes"
},
"properties": {
"my_field": {
"type": "string",
"index_analyzer": "custom_index",
"search_analyzer": "custom_search"
}
}
}

Note that you can (and often should) set up index analysis and search
analysis differently (eg if you use synonyms, only expand search terms).

Hope I haven't missed the point...

On Monday, June 30, 2014 8:47:36 AM UTC-4, mooky wrote:

Hi all,

I have a google-style search capability in my app that uses the _all
field with the default (standard) analyzer (I don't configure anything - so
its Elastic's default).

There are a few cases where we don't quite get the behaviour we want,
and I am trying to work out how I tweak the analyzer configuration.

  1. if the user searches using 99.97, then they get the results they
    expect, but if they search using 99.97%, they get nothing. They should get
    the results that match "99.97%". The default analyzer config loses the %, I
    guess.

  2. I have no idea what the text is ( : ) ) but the user wants to search
    using 托克金通贸易 - which is in the data - but currently we get zero results. It
    looks like the standard analyzer/tokenizer breaks on each character.

I think I just want a whitespace analyzer with lower-casing ....
However,
a) I am not exactly sure how to configure that, and;
b) I am not 100% sure what I am losing/gaining vs standard analyzer.
(dont need stop-words - in any case default cfg for standard analyser
doesn't have any IIRC)

(FWIW, on all our other text fields, we tend to use no analyzer)

(Elastic 1.1.1 and 1.2 ...)

Cheers.
-M

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4608f1da-6fcb-47fa-a6e5-490d9895879f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


#6

Thanks.
So default_index and default_search have special meaning.
Is this in the docs anywhere?

-N

On Monday, 30 June 2014 17:21:40 UTC+1, Glen Smith wrote:

Totally. For example:

        "analyzer": {
            "default_index": {
                "tokenizer": "standard",
                "filter": ["standard", "lowercase"]
            },
            "default_search": {
                "tokenizer": "standard",
                "filter": ["standard", "lowercase", "stop"]
            },

On Monday, June 30, 2014 12:19:55 PM UTC-4, mooky wrote:

Excellent. Thanks for the info.

Is it possible to set my custom analyser as the default analyser for an
index (ie instead of standard_analyzer)

-N

On Monday, 30 June 2014 14:41:10 UTC+1, Glen Smith wrote:

You can set up an analyser for your index...

...
"my-index": {
"analysis": {
"analyzer": {
"default_index": {
"tokenizer": "standard",
"filter": ["standard", "icu_fold_filter", "stop"]
},
"default_search": {
"tokenizer": "standard",
"filter": ["standard", "icu_fold_filter", "stop"]
},
"custom_index": {
"tokenizer": "whitespace",
"filter": ["lower"]
},
"custom_search": {
"tokenizer": "whitespace",
"filter": ["lower"]
}
}
}
}
...

and then map your relevant field accordingly:

{
"_timestamp": {
"enabled": "true",
"store": "yes"
},
"properties": {
"my_field": {
"type": "string",
"index_analyzer": "custom_index",
"search_analyzer": "custom_search"
}
}
}

Note that you can (and often should) set up index analysis and search
analysis differently (eg if you use synonyms, only expand search terms).

Hope I haven't missed the point...

On Monday, June 30, 2014 8:47:36 AM UTC-4, mooky wrote:

Hi all,

I have a google-style search capability in my app that uses the _all
field with the default (standard) analyzer (I don't configure anything - so
its Elastic's default).

There are a few cases where we don't quite get the behaviour we want,
and I am trying to work out how I tweak the analyzer configuration.

  1. if the user searches using 99.97, then they get the results they
    expect, but if they search using 99.97%, they get nothing. They should get
    the results that match "99.97%". The default analyzer config loses the %, I
    guess.

  2. I have no idea what the text is ( : ) ) but the user wants to search
    using 托克金通贸易 - which is in the data - but currently we get zero results. It
    looks like the standard analyzer/tokenizer breaks on each character.

I think I just want a whitespace analyzer with lower-casing ....
However,
a) I am not exactly sure how to configure that, and;
b) I am not 100% sure what I am losing/gaining vs standard analyzer.
(dont need stop-words - in any case default cfg for standard analyser
doesn't have any IIRC)

(FWIW, on all our other text fields, we tend to use no analyzer)

(Elastic 1.1.1 and 1.2 ...)

Cheers.
-M

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/20a33da6-0a79-4c48-b378-e5473828c507%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Glen Smith) #7

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-analyzers.html

On Tuesday, July 1, 2014 6:23:54 AM UTC-4, mooky wrote:

Thanks.
So default_index and default_search have special meaning.
Is this in the docs anywhere?

-N

On Monday, 30 June 2014 17:21:40 UTC+1, Glen Smith wrote:

Totally. For example:

        "analyzer": {
            "default_index": {
                "tokenizer": "standard",
                "filter": ["standard", "lowercase"]
            },
            "default_search": {
                "tokenizer": "standard",
                "filter": ["standard", "lowercase", "stop"]
            },

On Monday, June 30, 2014 12:19:55 PM UTC-4, mooky wrote:

Excellent. Thanks for the info.

Is it possible to set my custom analyser as the default analyser for an
index (ie instead of standard_analyzer)

-N

On Monday, 30 June 2014 14:41:10 UTC+1, Glen Smith wrote:

You can set up an analyser for your index...

...
"my-index": {
"analysis": {
"analyzer": {
"default_index": {
"tokenizer": "standard",
"filter": ["standard", "icu_fold_filter", "stop"]
},
"default_search": {
"tokenizer": "standard",
"filter": ["standard", "icu_fold_filter", "stop"]
},
"custom_index": {
"tokenizer": "whitespace",
"filter": ["lower"]
},
"custom_search": {
"tokenizer": "whitespace",
"filter": ["lower"]
}
}
}
}
...

and then map your relevant field accordingly:

{
"_timestamp": {
"enabled": "true",
"store": "yes"
},
"properties": {
"my_field": {
"type": "string",
"index_analyzer": "custom_index",
"search_analyzer": "custom_search"
}
}
}

Note that you can (and often should) set up index analysis and search
analysis differently (eg if you use synonyms, only expand search terms).

Hope I haven't missed the point...

On Monday, June 30, 2014 8:47:36 AM UTC-4, mooky wrote:

Hi all,

I have a google-style search capability in my app that uses the _all
field with the default (standard) analyzer (I don't configure anything - so
its Elastic's default).

There are a few cases where we don't quite get the behaviour we want,
and I am trying to work out how I tweak the analyzer configuration.

  1. if the user searches using 99.97, then they get the results they
    expect, but if they search using 99.97%, they get nothing. They should get
    the results that match "99.97%". The default analyzer config loses the %, I
    guess.

  2. I have no idea what the text is ( : ) ) but the user wants to
    search using 托克金通贸易 - which is in the data - but currently we get zero
    results. It looks like the standard analyzer/tokenizer breaks on each
    character.

I think I just want a whitespace analyzer with lower-casing ....
However,
a) I am not exactly sure how to configure that, and;
b) I am not 100% sure what I am losing/gaining vs standard analyzer.
(dont need stop-words - in any case default cfg for standard analyser
doesn't have any IIRC)

(FWIW, on all our other text fields, we tend to use no analyzer)

(Elastic 1.1.1 and 1.2 ...)

Cheers.
-M

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/da9ec5f7-89a0-4fa4-aafa-1ee05b226a94%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


#8

Ah. Cheers.
I had looked at that page a few times but missed that.

On Tuesday, 1 July 2014 19:04:56 UTC+1, Glen Smith wrote:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-analyzers.html

On Tuesday, July 1, 2014 6:23:54 AM UTC-4, mooky wrote:

Thanks.
So default_index and default_search have special meaning.
Is this in the docs anywhere?

-N

On Monday, 30 June 2014 17:21:40 UTC+1, Glen Smith wrote:

Totally. For example:

        "analyzer": {
            "default_index": {
                "tokenizer": "standard",
                "filter": ["standard", "lowercase"]
            },
            "default_search": {
                "tokenizer": "standard",
                "filter": ["standard", "lowercase", "stop"]
            },

On Monday, June 30, 2014 12:19:55 PM UTC-4, mooky wrote:

Excellent. Thanks for the info.

Is it possible to set my custom analyser as the default analyser for an
index (ie instead of standard_analyzer)

-N

On Monday, 30 June 2014 14:41:10 UTC+1, Glen Smith wrote:

You can set up an analyser for your index...

...
"my-index": {
"analysis": {
"analyzer": {
"default_index": {
"tokenizer": "standard",
"filter": ["standard", "icu_fold_filter", "stop"]
},
"default_search": {
"tokenizer": "standard",
"filter": ["standard", "icu_fold_filter", "stop"]
},
"custom_index": {
"tokenizer": "whitespace",
"filter": ["lower"]
},
"custom_search": {
"tokenizer": "whitespace",
"filter": ["lower"]
}
}
}
}
...

and then map your relevant field accordingly:

{
"_timestamp": {
"enabled": "true",
"store": "yes"
},
"properties": {
"my_field": {
"type": "string",
"index_analyzer": "custom_index",
"search_analyzer": "custom_search"
}
}
}

Note that you can (and often should) set up index analysis and search
analysis differently (eg if you use synonyms, only expand search terms).

Hope I haven't missed the point...

On Monday, June 30, 2014 8:47:36 AM UTC-4, mooky wrote:

Hi all,

I have a google-style search capability in my app that uses the _all
field with the default (standard) analyzer (I don't configure anything - so
its Elastic's default).

There are a few cases where we don't quite get the behaviour we want,
and I am trying to work out how I tweak the analyzer configuration.

  1. if the user searches using 99.97, then they get the results they
    expect, but if they search using 99.97%, they get nothing. They should get
    the results that match "99.97%". The default analyzer config loses the %, I
    guess.

  2. I have no idea what the text is ( : ) ) but the user wants to
    search using 托克金通贸易 - which is in the data - but currently we get zero
    results. It looks like the standard analyzer/tokenizer breaks on each
    character.

I think I just want a whitespace analyzer with lower-casing ....
However,
a) I am not exactly sure how to configure that, and;
b) I am not 100% sure what I am losing/gaining vs standard analyzer.
(dont need stop-words - in any case default cfg for standard analyser
doesn't have any IIRC)

(FWIW, on all our other text fields, we tend to use no analyzer)

(Elastic 1.1.1 and 1.2 ...)

Cheers.
-M

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6796a0dc-5eaa-4db4-ab47-400215743c61%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #9