Whitespace tokenizer not working as I'd expect

Hi all,

I'm trying to break up some strings to use in a full text search leaving
the original field intact. I have created a "full_text" field that is
populated from a "name" field using "copy_to" and an analyzer that looks
like this:

"settings" : {
    "analysis": {
        "char_filter" : {
            "full_text_mapping" : {
                "type": "mapping",
                "mappings" : [".=>%20", "_=>%20"]
            }
        },
        "analyzer" : {
            "full_text_analyzer" : {
                "type" : "custom",
                "char_filter" : "full_text_mapping",
                "tokenizer" : "whitespace",
                "filter" : ["lowercase"]
            }
        }
    }
},

As you can see I'm trying to convert '.' and '_' to ' ' before the
whitespace tokenizer kicks in. It's my understanding that the char_filter
will replace those characters with whitespace that the whitespace tokenizer
would then tokenize and then all components could be searchable. For
instance, I would expect "GRIZZLY.BEAR" to be found using both "grizzly"
and "bear". But with the whitespace tokenizer I am not able to find the
document with either term. So what am I not understanding? Full script
showing what I'm doing:

#!/bin/sh

ES=localhost:9200

echo ">>> Deleting _all"
curl -XDELETE $ES/_all

echo ">>> Creating the index 'animals'"
curl -XPUT $ES/animals -d'
{
"settings" : {
"analysis": {
"char_filter" : {
"full_text_mapping" : {
"type": "mapping",
"mappings" : [".=>%20", "_=>%20"]
}
},
"analyzer" : {
"full_text_analyzer" : {
"type" : "custom",
"char_filter" : "full_text_mapping",
"tokenizer" : "whitespace",
"filter" : ["lowercase"]
}
}
}
},
"mappings" : {
"bear" : {
"properties" : {
"suggest" : {
"type" : "completion",
"analyzer" : "simple",
"payloads" : true
},
"full_text" : {
"type" : "string",
"analyzer" : "full_text_analyzer"
},
"name" : {
"type" : "string",
"index" : "not_analyzed",
"copy_to" : "full_text"
}
}
}
}
}' && echo

echo ">>> Indexing the GRIZZLY.BEAR document"
curl -XPOST $ES/animals/bear -d'
{
"name": "GRIZZLY.BEAR"
}
' && echo

curl -XPOST $ES/animals/_flush && echo

Search for the document using the name

echo
echo ">>> Searching for name:GRIZZLY.BEAR"
echo
curl $ES/animals/bear/_search -d'
{
"query" : {
"match" : {
"name" : "GRIZZLY.BEAR"
}
}
}
' && echo

Search for the document using a general term

echo
echo ">>> Searching for full_text:grizzly"
echo
curl $ES/animals/bear/_search -d'
{
"query" : {
"match" : {
"full_text" : "grizzly"
}
}
}
' && echo

Search for the document using a general term

echo
echo ">>> Searching for full_text:bear"
echo
curl $ES/animals/bear/_search -d'
{
"query" : {
"match" : {
"full_text" : "bear"
}
}
}
' && echo

I appreciate any help with this!

Cheers,
Craig

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5fa2347f-3019-4973-9d67-7f18b3dfee9e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

From which source did you assume that %20 is a white space?

The mapping char filter understands \uXXXX notation (which is not
documented in ES).

With curl, on bash, you have to escape the \u notation with double
backslash like this

". => \u0020"

Here is a working example

Jörg

On Thu, Mar 12, 2015 at 3:41 PM, Craig Ching craigching@gmail.com wrote:

Hi all,

I'm trying to break up some strings to use in a full text search leaving
the original field intact. I have created a "full_text" field that is
populated from a "name" field using "copy_to" and an analyzer that looks
like this:

"settings" : {
    "analysis": {
        "char_filter" : {
            "full_text_mapping" : {
                "type": "mapping",
                "mappings" : [".=>%20", "_=>%20"]
            }
        },
        "analyzer" : {
            "full_text_analyzer" : {
                "type" : "custom",
                "char_filter" : "full_text_mapping",
                "tokenizer" : "whitespace",
                "filter" : ["lowercase"]
            }
        }
    }
},

As you can see I'm trying to convert '.' and '_' to ' ' before the
whitespace tokenizer kicks in. It's my understanding that the char_filter
will replace those characters with whitespace that the whitespace tokenizer
would then tokenize and then all components could be searchable. For
instance, I would expect "GRIZZLY.BEAR" to be found using both "grizzly"
and "bear". But with the whitespace tokenizer I am not able to find the
document with either term. So what am I not understanding? Full script
showing what I'm doing:

#!/bin/sh

ES=localhost:9200

echo ">>> Deleting _all"
curl -XDELETE $ES/_all

echo ">>> Creating the index 'animals'"
curl -XPUT $ES/animals -d'
{
"settings" : {
"analysis": {
"char_filter" : {
"full_text_mapping" : {
"type": "mapping",
"mappings" : [".=>%20", "_=>%20"]
}
},
"analyzer" : {
"full_text_analyzer" : {
"type" : "custom",
"char_filter" : "full_text_mapping",
"tokenizer" : "whitespace",
"filter" : ["lowercase"]
}
}
}
},
"mappings" : {
"bear" : {
"properties" : {
"suggest" : {
"type" : "completion",
"analyzer" : "simple",
"payloads" : true
},
"full_text" : {
"type" : "string",
"analyzer" : "full_text_analyzer"
},
"name" : {
"type" : "string",
"index" : "not_analyzed",
"copy_to" : "full_text"
}
}
}
}
}' && echo

echo ">>> Indexing the GRIZZLY.BEAR document"
curl -XPOST $ES/animals/bear -d'
{
"name": "GRIZZLY.BEAR"
}
' && echo

curl -XPOST $ES/animals/_flush && echo

Search for the document using the name

echo
echo ">>> Searching for name:GRIZZLY.BEAR"
echo
curl $ES/animals/bear/_search -d'
{
"query" : {
"match" : {
"name" : "GRIZZLY.BEAR"
}
}
}
' && echo

Search for the document using a general term

echo
echo ">>> Searching for full_text:grizzly"
echo
curl $ES/animals/bear/_search -d'
{
"query" : {
"match" : {
"full_text" : "grizzly"
}
}
}
' && echo

Search for the document using a general term

echo
echo ">>> Searching for full_text:bear"
echo
curl $ES/animals/bear/_search -d'
{
"query" : {
"match" : {
"full_text" : "bear"
}
}
}
' && echo

I appreciate any help with this!

Cheers,
Craig

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5fa2347f-3019-4973-9d67-7f18b3dfee9e%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5fa2347f-3019-4973-9d67-7f18b3dfee9e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHgoBCQjHMgWUVHDrWG%3DmD8SiCo52%3DVQSaLzt%3D-V%3DTe%2BA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

On Friday, March 13, 2015 at 4:47:31 AM UTC-5, Jörg Prante wrote:

From which source did you assume that %20 is a white space?

It was just a guess since, as you say, it's not documented :wink: After using
%20, it did appear to tokenize differently, though I couldn't figure out
how to prove that it had worked and I just assumed it did I guess.

The mapping char filter understands \uXXXX notation (which is not
documented in ES).

With curl, on bash, you have to escape the \u notation with double
backslash like this

". => \u0020"

Here is a working example

https://gist.github.com/jprante/d7c839d47e9a9dc78311

Awesome, thanks very much for that, works like a charm!

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d72e8b10-e025-429d-8edf-1a1ab3776fbd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.