How to do exact match with elasticsearch Chinese analysis plugin?


(hau) #1

I was using the elasticsearch-analysis-smartcn plugin to perform full text
search in Chinese.

I had a activerecord model Article which has a string content and many
tags. Tags could be written in Chinese and I want to do exact match on the
tags. I created this mapping inside of Article:

class Article < ActiveRecord::Base
include Tire::Model::Search
include Tire::Model::Callbacks
settings analysis: {
filter: {
smartcn_word: {
'type' => 'smartcn_word'
}
},
analyzer: {
smart_chinese: {
'tokenizer' => 'smartcn_sentence',
'filter' => ['smartcn_word'],
'type' => 'smart_chinese'
}
}
} do
mapping do
indexes :content, type: 'string', analyzer: 'smart_chinese'
indexes :tags, type: 'string', index: 'not_analyzed'
end
end
end

where Article's tags is an array of strings.

I had an Article with tags: ["irresistible"] and another Article with tags:
["恭喜发财"]. I wanted to search by exact matching on the tags. When I did

Article.search("resistible")

, it correctly returned zero match. But if I searched by

Article.search("发财")

, it returned one match. So, it's doing exact matching for English but a
substring match in Chinese. How can I make it also do exact matching for
Chinese?

Thanks for your help.


(hau) #2

I was using the tire gem https://github.com/karmi/tire in a rails
application.


(Shay Banon) #3

I think you end up searching against the _all field (the one that
aggregates all the other fields). Try and specify explicitly that you want
to search on the not analyzed tags.

On Sun, Jan 15, 2012 at 6:02 PM, hau hoki.au@gmail.com wrote:

I was using the tire gem https://github.com/karmi/tire in a rails
application.


(hau) #4

OK. Thanks. But could you help me by pointing out how that could be done,
like how to not search on _all?

I want to make the search match strings in the content field and exact
matching on individual tag. So it's considered a match if a keyword appear
inside Article.content OR if a keyword matches exactly one of the tags.

The tire call I used was Article.search(keyword). Or maybe that's not
correct? Or did I not define the mapping correctly?

在 2012年1月16日 上午8:09,Shay Banon kimchy@gmail.com写道:

I think you end up searching against the _all field (the one that
aggregates all the other fields). Try and specify explicitly that you want
to search on the not analyzed tags.

On Sun, Jan 15, 2012 at 6:02 PM, hau hoki.au@gmail.com wrote:

I was using the tire gem https://github.com/karmi/tire in a rails
application.

--
-hoki


(hau) #5

I changed the mapping to:

class Article < ActiveRecord::Base
include Tire::Model::Search
include Tire::Model::Callbacks
settings analysis: {
filter: {
smartcn_word: {
'type' => 'smartcn_word'
}
},
analyzer: {
smart_chinese: {
'tokenizer' => 'smartcn_sentence',
'filter' => ['smartcn_word'],
'type' => 'smart_chinese'
}
}
} do
mapping do
indexes :content, type: 'string', analyzer: 'smart_chinese'
indexes :tags, type: 'string', index: 'not_analyzed', include_in_all:
false
end
end
end

But when I did Article.search('发财'), it still matched the Article tagged
with '恭喜发财'.

在 2012年1月16日 上午8:52,Ho-Ki Au hoki.au@gmail.com写道:

OK. Thanks. But could you help me by pointing out how that could be
done, like how to not search on _all?

I want to make the search match strings in the content field and exact
matching on individual tag. So it's considered a match if a keyword appear
inside Article.content OR if a keyword matches exactly one of the tags.

The tire call I used was Article.search(keyword). Or maybe that's not
correct? Or did I not define the mapping correctly?

在 2012年1月16日 上午8:09,Shay Banon kimchy@gmail.com写道:

I think you end up searching against the _all field (the one that

aggregates all the other fields). Try and specify explicitly that you want
to search on the not analyzed tags.

On Sun, Jan 15, 2012 at 6:02 PM, hau hoki.au@gmail.com wrote:

I was using the tire gem https://github.com/karmi/tire in a rails
application.

--
-hoki

--
-hoki


(Matt) #6

I'm not sure exactly what Article.search does: does it just call the
endpoint, or can you use it to build a query, and in particular pick the
type of query (http://www.elasticsearch.org/guide/reference/query-dsl/)?

I've found that it's easier to first stick to the straight ES HTTP API, so
you can figure out exactly how the query DSL works (it's non-trivial, and
can be a little confusing), before trying to use one of the client
libraries. I've been working through this same situation, but with the
Python pyes library, also in a multilingual context, and the client can
mask away some of the details of the API.

To be honest, I'm surprised that "恭喜发财" is parsed as a single term; I
thought it would be split into "恭喜" and "发财", but I don't really know
Chinese (haha). Going back to the first point, are you doing a term query
(http://www.elasticsearch.org/guide/reference/query-dsl/term-query.html)?


(hau) #7

Despite reading your tips, I don't think I'm making progress. My original
example was not complete. Let me elaborate.

I want to do search on a class of document which are JSON representation of
Article objects. An Article could look like
{
title: "article subject line",
tags: ["chinese", "food"],
content: "article content"
}

where I want to do:

  1. substring match on title
  2. exact match on tags
  3. fulltext search on content
    And they are mostly Chinese, but could be a mix of Chinese and English

When I defined the mapping, I did
settings analysis: {
filter: {
smartcn_word: {
'type' => 'smartcn_word'
}
},
analyzer: {
smart_chinese: {
'tokenizer' => 'smartcn_sentence',
'filter' => ['smartcn_word'],
'type' => 'smart_chinese'
}
}
} do
mapping do
indexes :title, type: 'string', analyzer: 'keyword'
indexes :content, type: 'string', analyzer: 'smart_chinese'
indexes :tags, type: 'string', analyzer: 'keyword'
end
end

which means I used the Chinese analyzer for content and keyword analyzer
for title and tags.

When I performed the search, the query JSON looked like this (I got this
from the logs):

'{"query":{"bool":{"should":[{"term":{"title":"search
words"}},{"query_string":{"query":"search
words","default_field":"content","default_operator":"AND"}},{"terms":{"tag_list":["search",
"words"]}}],"minimum_number_should_match":1}}}'

Fulltext search on the content worked fine, but I had problem getting the
substring match on title and exact match on tags to work. They returned
zero search result. It would only return results when I queried against
_all, i.e. if the query JSON was:

'{"query":{"query_string":{"query":"search
words","default_operator":"AND"}}}'

I guess my question(s) is (or are):

  1. Am I correct in using "term" search and keyword analyzer for substring
    match in the title field?
  2. Am I correct in using the "terms" search and keyword analyzer for exact
    match for tags?
  3. Am I correct in using bool with three "should" queries in it?

I am seeing other issues when I mixed Chinese and English, but I think that
should go in another thread.

Please pardon me if my questions look dumb. I am new to elasticsearch (and
lucene), but I did read through the its Guide and Tutorials. I'd
appreciate it if someone who has more experience could help me figure this
out.

在 2012年1月17日 下午12:17,Matt matt.chu@gmail.com写道:

I'm not sure exactly what Article.search does: does it just call the
endpoint, or can you use it to build a query, and in particular pick the
type of query (http://www.elasticsearch.org/guide/reference/query-dsl/)?

I've found that it's easier to first stick to the straight ES HTTP API, so
you can figure out exactly how the query DSL works (it's non-trivial, and
can be a little confusing), before trying to use one of the client
libraries. I've been working through this same situation, but with the
Python pyes library, also in a multilingual context, and the client can
mask away some of the details of the API.

To be honest, I'm surprised that "恭喜发财" is parsed as a single term; I
thought it would be split into "恭喜" and "发财", but I don't really know
Chinese (haha). Going back to the first point, are you doing a term query (
http://www.elasticsearch.org/guide/reference/query-dsl/term-query.html)?

--
-hoki


(Matt) #8

I'm still learning ES myself, so I'm not sure I can answer all of those
questions, but here goes:

  1. For the "term" search type, the field itself should be set to
    "not_analyzed"; this should be set when you define the mapping. I think
    this also applies to "terms". There's an example
    on http://www.elasticsearch.org/guide/reference/api/admin-indices-create-index.html
    which should point the way.
  2. The bool looks fine, if ES doesn't throw a QueryParse exception (or
    whatever it's called)
  3. This is a wild guess, but you might not be using the same analyzer for
    indexing as for searching? Try setting the "analyzer" field
    (http://www.elasticsearch.org/guide/reference/api/search/uri-request.html)

See if some/all of those helps?


(system) #9