Custom analyzer for match phrase


(nicktackes) #1

Hi,

I have attached a simple example of an index that defines a custom analyzer
called casesensitive. I want the default behavior for the index to be the
standard index, however, I would like to be able to do a casesensitive
query at certain times. Can I define an analyzer as shown below just for
use in a multi match query? In the example below I would expect no hits
yet i still get the case insensitive hit returning.

thanks

curl -X DELETE localhost:9200/mhw_test
curl -X PUT localhost:9200/mhw_test -d '
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"casesensitive": {
"type":"custom",
"tokenizer": "standard",
"filter": ["standard", "stop"]
}
}
}
}
},
"mappings":{"test":{"properties":{
"title":{"type":"string"},
"tags":{"type":"string"}
}
}
}
}'

curl -X POST "http://localhost:9200/mhw_test/test" -d '
{
"title":"Metastatic Lung Cancer",
"tags":["MET","Lung","HGF"]
}'

curl -X POST "http://localhost:9200/mhw_test/test" -d '
{
"title":"Colorectal kras c.12345G>T",
"tags":["PTEN Loss","colorectal","KRAS"]
}'

curl -X POST "http://localhost:9200/mhw_test/test" -d '
{
"title":"Colorectal Kras",
"tags":["colorectal","KRAS c.12345G>T"]
}'

curl -XPOST 'http://localhost:9200/mhw_test/_refresh'

curl -X POST "http://localhost:9200/mhw_test/test/_search" -d '
{"filter":{
"query" : {
"multi_match" : {
"fields":["title", "tags"],
"query" : "met",
"type" : "phrase",
"analyzer": "casesensitive"
}
}
}
}'

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/74816325-1bc1-439b-abbc-032141ded717%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(nicktackes) #2

I am not seeing any difference either for a match query or a query string
query when specifying analyzer. Can someone provide some input on usage of
the analyzer at query time? Am I misunderstanding its purpose? I have
defined the analyzer at the index level for usage on an ad hoc basis ( i
would like the standard analyzer as the default behavior, hence nothing has
been explicitly been specified in the field mapping). The attached script
demonstrates the test case I am working with.

thanks

On Saturday, December 7, 2013 2:30:33 PM UTC-8, Nick Tackes wrote:

Hi,

I have attached a simple example of an index that defines a custom
analyzer called casesensitive. I want the default behavior for the index
to be the standard index, however, I would like to be able to do a
casesensitive query at certain times. Can I define an analyzer as shown
below just for use in a multi match query? In the example below I would
expect no hits yet i still get the case insensitive hit returning.

thanks

curl -X DELETE localhost:9200/mhw_test
curl -X PUT localhost:9200/mhw_test -d '
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"casesensitive": {
"type":"custom",
"tokenizer": "standard",
"filter": ["standard", "stop"]
}
}
}
}
},
"mappings":{"test":{"properties":{
"title":{"type":"string"},
"tags":{"type":"string"}
}
}
}
}'

curl -X POST "http://localhost:9200/mhw_test/test" -d '
{
"title":"Metastatic Lung Cancer",
"tags":["MET","Lung","HGF"]
}'

curl -X POST "http://localhost:9200/mhw_test/test" -d '
{
"title":"Colorectal kras c.12345G>T",
"tags":["PTEN Loss","colorectal","KRAS"]
}'

curl -X POST "http://localhost:9200/mhw_test/test" -d '
{
"title":"Colorectal Kras",
"tags":["colorectal","KRAS c.12345G>T"]
}'

curl -XPOST 'http://localhost:9200/mhw_test/_refresh'

curl -X POST "http://localhost:9200/mhw_test/test/_search" -d '
{"filter":{
"query" : {
"multi_match" : {
"fields":["title", "tags"],
"query" : "met",
"type" : "phrase",
"analyzer": "casesensitive"
}
}
}
}'

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/af3371eb-1e1c-4910-8095-0ba1f3bc4096%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Njål Karevoll) #3

Hi Nick,

In the example you provide, you the tags-field will contain the term "met"
after being analyzed with the standard analyzer, and while you're
specifying a custom case sensitive analyzer at query-time, your search term
is "met" in lower-case, which naturally matches the indexed term "met". If
you had searched for "MET" instead, you would have gotten zero results.

If you need to be able to sometimes search for terms that are case
sensitive, you should prepare for this index-time, using the multi-field
type and create a new "virtual" field that is analyzed the way you want. I
prepared a small "play" that demonstrates this using three different search
requests here: https://www.found.no/play/gist/7870684, which you can run
against your own cluster to verify:

#!/bin/bash

export ELASTICSEARCH_ENDPOINT="http://localhost:9200"

# Create indexes

curl -XPUT "$ELASTICSEARCH_ENDPOINT/play" -d '{
    "settings": {
        "analysis": {
            "analyzer": {
                "casesensitive": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "standard",
                        "stop"
                    ]
                }
            }
        }
    },
    "mappings": {
        "test": {
            "properties": {
                "title": {
                    "type": "multi_field",
                    "fields": {
                        "title": {
                            "type": "string"
                        },
                        "title.cased": {
                            "type": "string",
                            "analyzer": "casesensitive"
                        }
                    }
                },
                "tags": {
                    "type": "multi_field",
                    "fields": {
                        "tags": {
                            "type": "string"
                        },
                        "tags.cased": {
                            "type": "string",
                            "analyzer": "casesensitive"
                        }
                    }
                }
            }
        }
    }
}'


# Index documents
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_bulk?refresh=true" -d '
{"index":{"_index":"play","_type":"type"}}
{"title":"Metastatic Lung Cancer","tags":["MET","Lung","HGF"]}
{"index":{"_index":"play","_type":"type"}}
{"title":"Colorectal kras c.12345G>T","tags":["PTEN 

Loss","colorectal","KRAS"]}
{"index":{"_index":"play","_type":"type"}}
{"title":"Colorectal Kras","tags":["colorectal","KRAS c.12345G>T"]}
'

# Do searches

curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
{
    "filter": {
        "query": {
            "multi_match": {
                "fields": [
                    "title",
                    "tags"
                ],
                "query": "met",
                "type": "phrase",
                "analyzer": "casesensitive"
            }
        }
    }
}
'

curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
{
    "filter": {
        "query": {
            "multi_match": {
                "fields": [
                    "title",
                    "tags"
                ],
                "query": "MET",
                "type": "phrase",
                "analyzer": "casesensitive"
            }
        }
    }
}
'

curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
{
    "filter": {
        "query": {
            "multi_match": {
                "fields": [
                    "title.cased",
                    "tags.cased"
                ],
                "query": "met",
                "type": "phrase",
                "analyzer": "casesensitive"
            }
        }
    }
}
'

Sincerely,
Njal Karevoll

On Saturday, December 7, 2013 11:30:33 PM UTC+1, Nick Tackes wrote:

Hi,

I have attached a simple example of an index that defines a custom
analyzer called casesensitive. I want the default behavior for the index
to be the standard index, however, I would like to be able to do a
casesensitive query at certain times. Can I define an analyzer as shown
below just for use in a multi match query? In the example below I would
expect no hits yet i still get the case insensitive hit returning.

thanks

curl -X DELETE localhost:9200/mhw_test
curl -X PUT localhost:9200/mhw_test -d '
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"casesensitive": {
"type":"custom",
"tokenizer": "standard",
"filter": ["standard", "stop"]
}
}
}
}
},
"mappings":{"test":{"properties":{
"title":{"type":"string"},
"tags":{"type":"string"}
}
}
}
}'

curl -X POST "http://localhost:9200/mhw_test/test" -d '
{
"title":"Metastatic Lung Cancer",
"tags":["MET","Lung","HGF"]
}'

curl -X POST "http://localhost:9200/mhw_test/test" -d '
{
"title":"Colorectal kras c.12345G>T",
"tags":["PTEN Loss","colorectal","KRAS"]
}'

curl -X POST "http://localhost:9200/mhw_test/test" -d '
{
"title":"Colorectal Kras",
"tags":["colorectal","KRAS c.12345G>T"]
}'

curl -XPOST 'http://localhost:9200/mhw_test/_refresh'

curl -X POST "http://localhost:9200/mhw_test/test/_search" -d '
{"filter":{
"query" : {
"multi_match" : {
"fields":["title", "tags"],
"query" : "met",
"type" : "phrase",
"analyzer": "casesensitive"
}
}
}
}'

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6bf67d29-a861-48f6-9b6d-2dd2245266c1%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(nicktackes) #4

Thank you Njal for your reply. It is making a bit more sense.

On the search curl requests, if i remove the analyzer reference it works
identically because of the mappings that have been defined. I am still a
bit confused as to when i can use the analyzer on the match query. Shown
below are two search cases...

  1. If i specify the casesensitive analyzer on the fields mapped to the
    standard analyzer, it does not work, meaning that it must still be using
    the indexed mapping....

    curl -XPOST "localhost:9200/mhw_test/test/_search?pretty" -d '
    {
    "filter": {
    "query": {
    "multi_match": {
    "fields": [
    "title",
    "tags"
    ],
    "query": "MET",
    "type": "phrase",
    "analyzer": "casesensitive"
    }
    }
    }
    }
    '

  2. If I specify the standard analyzer on the fields mapped to the
    casesensitive analyzer, it appears to use the standard analyzer, as no
    results are returned...

    curl -XPOST "localhost:9200/mhw_test/test/_search?pretty" -d '
    {
    "filter": {
    "query": {
    "multi_match": {
    "fields": [
    "title.cased",
    "tags.cased"
    ],
    "query": "MET",
    "type": "phrase",
    "analyzer": "standard"
    }
    }
    }
    }
    '

On Mon, Dec 9, 2013 at 3:16 AM, Njål Karevoll njal@karevoll.no wrote:

Hi Nick,

In the example you provide, you the tags-field will contain the term "met"
after being analyzed with the standard analyzer, and while you're
specifying a custom case sensitive analyzer at query-time, your search term
is "met" in lower-case, which naturally matches the indexed term "met". If
you had searched for "MET" instead, you would have gotten zero results.

If you need to be able to sometimes search for terms that are case
sensitive, you should prepare for this index-time, using the multi-field
type and create a new "virtual" field that is analyzed the way you want. I
prepared a small "play" that demonstrates this using three different search
requests here: https://www.found.no/play/gist/7870684, which you can run
against your own cluster to verify:

#!/bin/bash

export ELASTICSEARCH_ENDPOINT="http://localhost:9200"

# Create indexes

curl -XPUT "$ELASTICSEARCH_ENDPOINT/play" -d '{
    "settings": {
        "analysis": {
            "analyzer": {
                "casesensitive": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "standard",
                        "stop"
                    ]
                }
            }
        }
    },
    "mappings": {
        "test": {
            "properties": {
                "title": {
                    "type": "multi_field",
                    "fields": {
                        "title": {
                            "type": "string"
                        },
                        "title.cased": {
                            "type": "string",
                            "analyzer": "casesensitive"
                        }
                    }
                },
                "tags": {
                    "type": "multi_field",
                    "fields": {
                        "tags": {
                            "type": "string"
                        },
                        "tags.cased": {
                            "type": "string",
                            "analyzer": "casesensitive"
                        }
                    }
                }
            }
        }
    }
}'


# Index documents
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_bulk?refresh=true" -d '
{"index":{"_index":"play","_type":"type"}}
{"title":"Metastatic Lung Cancer","tags":["MET","Lung","HGF"]}
{"index":{"_index":"play","_type":"type"}}
{"title":"Colorectal kras c.12345G>T","tags":["PTEN

Loss","colorectal","KRAS"]}
{"index":{"_index":"play","_type":"type"}}
{"title":"Colorectal Kras","tags":["colorectal","KRAS c.12345G>T"]}
'

# Do searches

curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
{
    "filter": {
        "query": {
            "multi_match": {
                "fields": [
                    "title",
                    "tags"
                ],
                "query": "met",
                "type": "phrase",
                "analyzer": "casesensitive"
            }
        }
    }
}
'

curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
{
    "filter": {
        "query": {
            "multi_match": {
                "fields": [
                    "title",
                    "tags"
                ],
                "query": "MET",
                "type": "phrase",
                "analyzer": "casesensitive"
            }
        }
    }
}
'

curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
{
    "filter": {
        "query": {
            "multi_match": {
                "fields": [
                    "title.cased",
                    "tags.cased"
                ],
                "query": "met",
                "type": "phrase",
                "analyzer": "casesensitive"
            }
        }
    }
}
'

Sincerely,
Njal Karevoll

On Saturday, December 7, 2013 11:30:33 PM UTC+1, Nick Tackes wrote:

Hi,

I have attached a simple example of an index that defines a custom
analyzer called casesensitive. I want the default behavior for the index
to be the standard index, however, I would like to be able to do a
casesensitive query at certain times. Can I define an analyzer as shown
below just for use in a multi match query? In the example below I would
expect no hits yet i still get the case insensitive hit returning.

thanks

curl -X DELETE localhost:9200/mhw_test
curl -X PUT localhost:9200/mhw_test -d '
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"casesensitive": {
"type":"custom",
"tokenizer": "standard",
"filter": ["standard", "stop"]
}
}
}
}
},
"mappings":{"test":{"properties":{
"title":{"type":"string"},
"tags":{"type":"string"}
}
}
}
}'

curl -X POST "http://localhost:9200/mhw_test/test" -d '
{
"title":"Metastatic Lung Cancer",
"tags":["MET","Lung","HGF"]
}'

curl -X POST "http://localhost:9200/mhw_test/test" -d '
{
"title":"Colorectal kras c.12345G>T",
"tags":["PTEN Loss","colorectal","KRAS"]
}'

curl -X POST "http://localhost:9200/mhw_test/test" -d '
{
"title":"Colorectal Kras",
"tags":["colorectal","KRAS c.12345G>T"]
}'

curl -XPOST 'http://localhost:9200/mhw_test/_refresh'

curl -X POST "http://localhost:9200/mhw_test/test/_search" -d '
{"filter":{
"query" : {
"multi_match" : {
"fields":["title", "tags"],
"query" : "met",
"type" : "phrase",
"analyzer": "casesensitive"
}
}
}
}'

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/NohHxkp5x0o/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6bf67d29-a861-48f6-9b6d-2dd2245266c1%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALU3eWv1N5qUyjDpn5HGu0VE7K%2BTVXLQtb-2QGyeZMnGyyKheg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #5