Indexing and searching on special characters?


(tsandstr) #1

I am having some troubles with how the indexing works on strings. I use normal mapping for strings and query_string for my queries.

I index the following:

curl -XPUT http://localhost:9200/twitter/tweet/1 -d '{
"message": "A&B: This is just a test"
}'
curl -XPUT http://localhost:9200/twitter/tweet/2 -d '{
"message": "A+B: This is just a test"
}'
curl -XPUT http://localhost:9200/twitter/tweet/3 -d '{
"message": "A-B: This is just a test"
}'
curl -XPUT http://localhost:9200/twitter/tweet/4 -d '{
"message": "AB: This is just a test"
}'
curl -XPUT http://localhost:9200/twitter/tweet/5 -d '{
"message": "A&C: This is just a test"
}'
curl -XPUT http://localhost:9200/twitter/tweet/6 -d '{
"message": "A+C: This is just a test"
}'
curl -XPUT http://localhost:9200/twitter/tweet/7 -d '{
"message": "A-C: This is just a test"
}'
curl -XPUT http://localhost:9200/twitter/tweet/8 -d '{
"message": "A
C: This is just a test"
}'

and I get the following results with my queries:

{'query': {'query_string': {'query': 'A+B', 'default_operator': 'AND', 'default_field': 'message'}}} {u'message': u'A&B: This is just a test'} {u'message': u'A+B: This is just a test'} {u'message': u'A-B: This is just a test'} {u'message': u'A*B: This is just a test'} --> I Expected to get only: A+B: This is just a test

{'query': {'query_string': {'query': 'A&B', 'default_operator': 'AND', 'default_field': 'message'}}}
{u'message': u'A&B: This is just a test'}
{u'message': u'A+B: This is just a test'}
{u'message': u'A-B: This is just a test'}
{u'message': u'A*B: This is just a test'}
--> I Expected to get only: A&B: This is just a test

{'query': {'query_string': {'query': 'A*B', 'default_operator': 'AND', 'default_field': 'message'}}}
--> Got no results at all, this is VERY confusing

{'query': {'query_string': {'query': 'A-B', 'default_operator': 'AND', 'default_field': 'message'}}}
{u'message': u'A&B: This is just a test'}
{u'message': u'A+B: This is just a test'}
{u'message': u'A-B: This is just a test'}
{u'message': u'A*B: This is just a test'}
--> I Expected to get only: A&B: This is just a test

{'query': {'query_string': {'query': 'A B', 'default_operator': 'AND', 'default_field': 'message'}}}
{u'message': u'A&B: This is just a test'}
{u'message': u'A+B: This is just a test'}
{u'message': u'A-B: This is just a test'}
{u'message': u'A*B: This is just a test'}

Why does the query A*B not return anything at all?
Is there a simple way to know how special characters are indexed by default?
It seems to me that A&B, A+B and A-B are all indexed the same way as A B, am I correct?

What about all other characters? How do they behave and how should I construct my query in order to get correct results? Escaping them does not seem to help.

Any help would be appreciated.

Thank you!

  • Tommy

Search for % character
(tsandstr) #2

Nevermind, it seems that I should escape all extra characters. And since the standard analyzer either replaces the special character with spaces or does something else. I will end up with several results no mather which special character is used.

This is okay for me, since I do not really care about the extra characters and do not need a search on them.

In some cases though it could be good to do the indexing of the extra characters too. But I guess the only way to do that is to create an own analyzer and use it?

One more question. How are endings such as 's handled? For example: Twitter's

I would rather remove the 's endings, instead of having them, during index. Can this behaviour be easily achieved?

  • Tommy

(Chris Male) #3

Tommy,

On Friday, October 12, 2012 12:31:43 AM UTC+13, tommy wrote:

Nevermind, it seems that I should escape all extra characters. And since
the
standard analyzer either replaces the special character with spaces or
does
something else. I will end up with several results no mather which special
character is used.

This is okay for me, since I do not really care about the extra characters
and do not need a search on them.

In some cases though it could be good to do the indexing of the extra
characters too. But I guess the only way to do that is to create an own
analyzer and use it?

StandardAnalyzer is really just built of a Tokenizer and a number of
TokenFilters. To have different behaviour (such as indexing extra
characters) you can just define a new Analyzer your ES mapping which uses a
different Tokenizer or TokenFilters.

One more question. How are endings such as 's handled? For example:
Twitter's

I would rather remove the 's endings, instead of having them, during
index.
Can this behaviour be easily achieved?

  • Tommy

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Indexing-and-searching-on-special-characters-tp4023761p4023781.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--


(system) #4