Changing tokenizer from whitespace to standard

Andy_Bajka_2 · April 16, 2013, 4:41pm

Looks like using "whitespace" doesn't work very well for my forum searches
as we often search for alphanumeric words. When I do a search for example:

test12345678

I get back thousands of results when I should get back only one.

I assume that if I change the "whitespace" to "standard" this will correct
the problem. Here is a portion of my analyzer code.

"settings" : {
    "index" : {
        "number_of_shards" : 5,
        "number_of_replicas" : 0
    }, 
    "analysis" : {
        "filter" : {
            "tweet_filter" : {
                "type" : "word_delimiter",
                "type_table": ["( => ALPHA", ") => ALPHA"]
            } 
        },
        "analyzer" : {
            "tweet_analyzer" : {
                "type" : "custom",
                "tokenizer" : "whitespace",
                "filter" : ["lowercase", "tweet_filter"]
            }
        }
    }
},

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Andy_Bajka_2 · April 16, 2013, 5:27pm

I changed it from whitespace to standard and re-indexed, unfortunately that
didn't help.

I'm going to go back to whitespace and for now only allow alpha characters
to be searched with the exception of parenthesis.

Hopefully someone with expertise will have a better solution.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

spinscale · April 17, 2013, 6:39am

Hey,

can you a show two sample documents (one which is returned correctly, one
which is not returned correctly) as well as your query in order to debug
your problem?

Also, you should checkout the analyzer API which allows you to see, how
strings are tokenized

curl 'localhost:9200/_analyze?analyzer=standard&pretty' -d 'test12345678'
curl 'localhost:9200/_analyze?analyzer=whitespace&pretty' -d 'test12345678'

Both outputs do not differ, so it is clear that your change did not have
any effect. See more at

Also you might want to install the excellect inquisitor plugin, so you have
a nice web gui for analyzing stuff, see

--Alex

On Tue, Apr 16, 2013 at 7:27 PM, Andy Bajka andybajka2012@gmail.com wrote:

I changed it from whitespace to standard and re-indexed, unfortunately
that didn't help.

I'm going to go back to whitespace and for now only allow alpha characters
to be searched with the exception of parenthesis.

Hopefully someone with expertise will have a better solution.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

vallabh · November 22, 2013, 12:24pm

Hi Everyone,

I am using analysis-phonetic plugin for searching which follows lucene.

I do have artist names stores as ke$ha, !!! (chk chk chk), Jay-Z ans so on.

For exclamation mark, i am escaping those special characters because exclamation mark breaks the query and use synonyn method to match ke$ha and !!! (chk chk chk) and set "tokenizer" : "whitespace".
In this case, i am searching the text as kesha (without $) i am getting the expected result as ke$ha.
and when i search !!! (3 exclamation) i am getting !!! (chk chk chk).

But the thing is that,
For Jay-Z, i wanted to search as jay z (without hyphen and with space in between).
But this works when i set "tokenizer" : "standard".

And when i set "tokenizer" : "standard" then kesha and exclamation do not work.

I wanted to use both tokenizer together.
I think this is possible with custom tokenizer. But unable to develop due to new in elasticsearch.

I have created 2 files,

process.sh - where i am doing indexing

echo 'Delete the index.'
curl -X DELETE 'http://localhost:9200/admin/?pretty=true'

echo; echo
echo 'Create the index.'

curl -X PUT 'http://localhost:9200/admin/?pretty=true' -d '
{
"settings" : {
"analysis" : {
"analyzer" : {
"artist_analyzer" : {
"tokenizer" : "whitespace",
"filter" : ["standard", "lowercase", "synonym", "artist_metaphone", "asciifolding"]
}
},
"filter" : {
"artist_metaphone" : {
"type" : "phonetic",
"encoder" : "metaphone",
"replace" : false
},
"synonym" : {
"type" : "synonym",
"synonyms_path" : "/var/www/html/elasticsearch-master/synonyms.txt"
}
}
}
}
}
'

echo; echo
echo 'Create the mapping.'
curl -X PUT 'http://localhost:9200/admin/jos_artist_details/_mapping?pretty=true' -d '
{
"jos_artist_details" : {
"properties" : {
"name" : {
"type": "string",
"index_analyzer": "artist_analyzer",
"search_analyzer": "artist_analyzer"
}

}
}
'

artist_display.php - where i am searching and displaying the data

$es = Client::connection(array(
'servers' => '127.0.0.1:9200',
'protocol' => 'http',
'index' => 'admin',
'type' => 'jos_artist_details'
));

$result = $es->search(array(
"query" => array(
"dis_max" => array(
"queries" => array(
0 => array(
"field" => array(
"name" => $search
)
)
)
)
),
"from" => 0,
"size" => 100000
)
);

$total = $result['hits']['total'];
$data = $result['hits']['hits'];

Any help is very much aprreciated.
Thanks,

Topic		Replies	Views
Aalyzer issue - terms not getting tokenized on whitespace Elasticsearch	1	303	July 6, 2017
Standard analyzer Elasticsearch	6	327	June 6, 2019
Seperate tokenizer for Search and Indexing Elasticsearch	2	326	July 6, 2017
Analyzer settings for breaking up words on hyphens Elasticsearch	4	2223	July 6, 2017
Whitespace analyzer Elasticsearch	4	325	July 6, 2017

Changing tokenizer from whitespace to standard

Related topics