Whitespace tokenizer not working as I'd expect

cching · March 12, 2015, 2:41pm

Hi all,

I'm trying to break up some strings to use in a full text search leaving
the original field intact. I have created a "full_text" field that is
populated from a "name" field using "copy_to" and an analyzer that looks
like this:

"settings" : {
    "analysis": {
        "char_filter" : {
            "full_text_mapping" : {
                "type": "mapping",
                "mappings" : [".=>%20", "_=>%20"]
            }
        },
        "analyzer" : {
            "full_text_analyzer" : {
                "type" : "custom",
                "char_filter" : "full_text_mapping",
                "tokenizer" : "whitespace",
                "filter" : ["lowercase"]
            }
        }
    }
},

As you can see I'm trying to convert '.' and '_' to ' ' before the
whitespace tokenizer kicks in. It's my understanding that the char_filter
will replace those characters with whitespace that the whitespace tokenizer
would then tokenize and then all components could be searchable. For
instance, I would expect "GRIZZLY.BEAR" to be found using both "grizzly"
and "bear". But with the whitespace tokenizer I am not able to find the
document with either term. So what am I not understanding? Full script
showing what I'm doing:

#!/bin/sh

ES=localhost:9200

echo ">>> Deleting _all"
curl -XDELETE $ES/_all

echo ">>> Creating the index 'animals'"
curl -XPUT $ES/animals -d'
{
"settings" : {
"analysis": {
"char_filter" : {
"full_text_mapping" : {
"type": "mapping",
"mappings" : [".=>%20", "_=>%20"]
}
},
"analyzer" : {
"full_text_analyzer" : {
"type" : "custom",
"char_filter" : "full_text_mapping",
"tokenizer" : "whitespace",
"filter" : ["lowercase"]
}
}
}
},
"mappings" : {
"bear" : {
"properties" : {
"suggest" : {
"type" : "completion",
"analyzer" : "simple",
"payloads" : true
},
"full_text" : {
"type" : "string",
"analyzer" : "full_text_analyzer"
},
"name" : {
"type" : "string",
"index" : "not_analyzed",
"copy_to" : "full_text"
}
}
}
}
}' && echo

echo ">>> Indexing the GRIZZLY.BEAR document"
curl -XPOST $ES/animals/bear -d'
{
"name": "GRIZZLY.BEAR"
}
' && echo

curl -XPOST $ES/animals/_flush && echo

Search for the document using the name

echo
echo ">>> Searching for name:GRIZZLY.BEAR"
echo
curl $ES/animals/bear/_search -d'
{
"query" : {
"match" : {
"name" : "GRIZZLY.BEAR"
}
}
}
' && echo

Search for the document using a general term

echo
echo ">>> Searching for full_text:grizzly"
echo
curl $ES/animals/bear/_search -d'
{
"query" : {
"match" : {
"full_text" : "grizzly"
}
}
}
' && echo

Search for the document using a general term

echo
echo ">>> Searching for full_text:bear"
echo
curl $ES/animals/bear/_search -d'
{
"query" : {
"match" : {
"full_text" : "bear"
}
}
}
' && echo

I appreciate any help with this!

Cheers,
Craig

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5fa2347f-3019-4973-9d67-7f18b3dfee9e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · March 13, 2015, 9:47am

From which source did you assume that %20 is a white space?

The mapping char filter understands \uXXXX notation (which is not
documented in ES).

With curl, on bash, you have to escape the \u notation with double
backslash like this

". => \u0020"

Here is a working example

gist.github.com

https://gist.github.com/jprante/d7c839d47e9a9dc78311

char-filter.sh


curl -XDELETE 'localhost:9200/test'

curl -XPUT 'localhost:9200/test' -d '
{
    "settings" : {
        "analysis": {
            "char_filter" : {
                "full_text_mapping" : {
                    "type": "mapping",

This file has been truncated. show original

Jörg

On Thu, Mar 12, 2015 at 3:41 PM, Craig Ching craigching@gmail.com wrote:

Hi all,

I'm trying to break up some strings to use in a full text search leaving
the original field intact. I have created a "full_text" field that is
populated from a "name" field using "copy_to" and an analyzer that looks
like this:
"settings" : {
    "analysis": {
        "char_filter" : {
            "full_text_mapping" : {
                "type": "mapping",
                "mappings" : [".=>%20", "_=>%20"]
            }
        },
        "analyzer" : {
            "full_text_analyzer" : {
                "type" : "custom",
                "char_filter" : "full_text_mapping",
                "tokenizer" : "whitespace",
                "filter" : ["lowercase"]
            }
        }
    }
},
As you can see I'm trying to convert '.' and '_' to ' ' before the
whitespace tokenizer kicks in. It's my understanding that the char_filter
will replace those characters with whitespace that the whitespace tokenizer
would then tokenize and then all components could be searchable. For
instance, I would expect "GRIZZLY.BEAR" to be found using both "grizzly"
and "bear". But with the whitespace tokenizer I am not able to find the
document with either term. So what am I not understanding? Full script
showing what I'm doing:

#!/bin/sh

ES=localhost:9200

echo ">>> Deleting _all"
curl -XDELETE $ES/_all

echo ">>> Creating the index 'animals'"
curl -XPUT $ES/animals -d'
{
"settings" : {
"analysis": {
"char_filter" : {
"full_text_mapping" : {
"type": "mapping",
"mappings" : [".=>%20", "_=>%20"]
}
},
"analyzer" : {
"full_text_analyzer" : {
"type" : "custom",
"char_filter" : "full_text_mapping",
"tokenizer" : "whitespace",
"filter" : ["lowercase"]
}
}
}
},
"mappings" : {
"bear" : {
"properties" : {
"suggest" : {
"type" : "completion",
"analyzer" : "simple",
"payloads" : true
},
"full_text" : {
"type" : "string",
"analyzer" : "full_text_analyzer"
},
"name" : {
"type" : "string",
"index" : "not_analyzed",
"copy_to" : "full_text"
}
}
}
}
}' && echo

echo ">>> Indexing the GRIZZLY.BEAR document"
curl -XPOST $ES/animals/bear -d'
{
"name": "GRIZZLY.BEAR"
}
' && echo

curl -XPOST $ES/animals/_flush && echo

Search for the document using the name

echo
echo ">>> Searching for name:GRIZZLY.BEAR"
echo
curl $ES/animals/bear/_search -d'
{
"query" : {
"match" : {
"name" : "GRIZZLY.BEAR"
}
}
}
' && echo

Search for the document using a general term

echo
echo ">>> Searching for full_text:grizzly"
echo
curl $ES/animals/bear/_search -d'
{
"query" : {
"match" : {
"full_text" : "grizzly"
}
}
}
' && echo

Search for the document using a general term

echo
echo ">>> Searching for full_text:bear"
echo
curl $ES/animals/bear/_search -d'
{
"query" : {
"match" : {
"full_text" : "bear"
}
}
}
' && echo

I appreciate any help with this!

Cheers,
Craig

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5fa2347f-3019-4973-9d67-7f18b3dfee9e%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5fa2347f-3019-4973-9d67-7f18b3dfee9e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHgoBCQjHMgWUVHDrWG%3DmD8SiCo52%3DVQSaLzt%3D-V%3DTe%2BA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

cching · March 16, 2015, 1:41pm

On Friday, March 13, 2015 at 4:47:31 AM UTC-5, Jörg Prante wrote:

From which source did you assume that %20 is a white space?

It was just a guess since, as you say, it's not documented After using
%20, it did appear to tokenize differently, though I couldn't figure out
how to prove that it had worked and I just assumed it did I guess.

The mapping char filter understands \uXXXX notation (which is not
documented in ES).

With curl, on bash, you have to escape the \u notation with double
backslash like this

". => \u0020"

Here is a working example

Char filter demo · GitHub

Awesome, thanks very much for that, works like a charm!

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d72e8b10-e025-429d-8edf-1a1ab3776fbd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Changing tokenizer from whitespace to standard Elasticsearch	4	2595	July 6, 2017
When using a whitespace tokenizer the stop words filter doesn't work Elasticsearch	2	706	July 5, 2017
EL setup for fulltext search Elasticsearch	11	590	July 6, 2017
Aalyzer issue - terms not getting tokenized on whitespace Elasticsearch	1	320	July 6, 2017
ElasticSearch won't recongize char_filter mapping Elasticsearch	6	1083	July 6, 2017

Whitespace tokenizer not working as I'd expect

Search for the document using the name

Search for the document using a general term

Search for the document using a general term

Search for the document using the name

Search for the document using a general term

Search for the document using a general term

Related topics