Changing tokenizer from whitespace to standard


(Andy Bajka-2) #1

Looks like using "whitespace" doesn't work very well for my forum searches
as we often search for alphanumeric words. When I do a search for example:

test12345678

I get back thousands of results when I should get back only one.

I assume that if I change the "whitespace" to "standard" this will correct
the problem. Here is a portion of my analyzer code.

"settings" : {
    "index" : {
        "number_of_shards" : 5,
        "number_of_replicas" : 0
    }, 
    "analysis" : {
        "filter" : {
            "tweet_filter" : {
                "type" : "word_delimiter",
                "type_table": ["( => ALPHA", ") => ALPHA"]
            } 
        },
        "analyzer" : {
            "tweet_analyzer" : {
                "type" : "custom",
                "tokenizer" : "whitespace",
                "filter" : ["lowercase", "tweet_filter"]
            }
        }
    }
},

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Andy Bajka-2) #2

I changed it from whitespace to standard and re-indexed, unfortunately that
didn't help.

I'm going to go back to whitespace and for now only allow alpha characters
to be searched with the exception of parenthesis.

Hopefully someone with expertise will have a better solution.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Alexander Reelsen) #3

Hey,

can you a show two sample documents (one which is returned correctly, one
which is not returned correctly) as well as your query in order to debug
your problem?

Also, you should checkout the analyzer API which allows you to see, how
strings are tokenized

curl 'localhost:9200/_analyze?analyzer=standard&pretty' -d 'test12345678'
curl 'localhost:9200/_analyze?analyzer=whitespace&pretty' -d 'test12345678'

Both outputs do not differ, so it is clear that your change did not have
any effect. See more at
http://www.elasticsearch.org/guide/reference/api/admin-indices-analyze/

Also you might want to install the excellect inquisitor plugin, so you have
a nice web gui for analyzing stuff, see

--Alex

On Tue, Apr 16, 2013 at 7:27 PM, Andy Bajka andybajka2012@gmail.com wrote:

I changed it from whitespace to standard and re-indexed, unfortunately
that didn't help.

I'm going to go back to whitespace and for now only allow alpha characters
to be searched with the exception of parenthesis.

Hopefully someone with expertise will have a better solution.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(vallabh) #4

Hi Everyone,

I am using analysis-phonetic plugin for searching which follows lucene.

I do have artist names stores as ke$ha, !!! (chk chk chk), Jay-Z ans so on.

For exclamation mark, i am escaping those special characters because exclamation mark breaks the query and use synonyn method to match ke$ha and !!! (chk chk chk) and set "tokenizer" : "whitespace".
In this case, i am searching the text as kesha (without $) i am getting the expected result as ke$ha.
and when i search !!! (3 exclamation) i am getting !!! (chk chk chk).

But the thing is that,
For Jay-Z, i wanted to search as jay z (without hyphen and with space in between).
But this works when i set "tokenizer" : "standard".

And when i set "tokenizer" : "standard" then kesha and exclamation do not work.

I wanted to use both tokenizer together.
I think this is possible with custom tokenizer. But unable to develop due to new in elasticsearch.

I have created 2 files,

  1. process.sh - where i am doing indexing

echo 'Delete the index.'
curl -X DELETE 'http://localhost:9200/admin/?pretty=true'

echo; echo
echo 'Create the index.'

curl -X PUT 'http://localhost:9200/admin/?pretty=true' -d '
{
"settings" : {
"analysis" : {
"analyzer" : {
"artist_analyzer" : {
"tokenizer" : "whitespace",
"filter" : ["standard", "lowercase", "synonym", "artist_metaphone", "asciifolding"]
}
},
"filter" : {
"artist_metaphone" : {
"type" : "phonetic",
"encoder" : "metaphone",
"replace" : false
},
"synonym" : {
"type" : "synonym",
"synonyms_path" : "/var/www/html/elasticsearch-master/synonyms.txt"
}
}
}
}
}
'

echo; echo
echo 'Create the mapping.'
curl -X PUT 'http://localhost:9200/admin/jos_artist_details/_mapping?pretty=true' -d '
{
"jos_artist_details" : {
"properties" : {
"name" : {
"type": "string",
"index_analyzer": "artist_analyzer",
"search_analyzer": "artist_analyzer"
}

}

}
}
'

  1. artist_display.php - where i am searching and displaying the data

$es = Client::connection(array(
'servers' => '127.0.0.1:9200',
'protocol' => 'http',
'index' => 'admin',
'type' => 'jos_artist_details'
));

$result = $es->search(array(
"query" => array(
"dis_max" => array(
"queries" => array(
0 => array(
"field" => array(
"name" => $search
)
)
)
)
),
"from" => 0,
"size" => 100000
)
);

$total = $result['hits']['total'];
$data = $result['hits']['hits'];

Any help is very much aprreciated.
Thanks,


(system) #5