Issue with using Thai Language Analyzer


(Mishari) #1

Hi,

I'm having an issue with the Thai Language Analyzer I have the
following mapping defined:

{ u'parsedtext': { 'index': 'analyzed', 'store': 'yes', 'type':
u'string', 'index_analyzer': u'thai', 'search_analyzer': u'thai' } }

and I've submitted the following sentence for indexing {"parsedtext":
u"ฉันนั่งตากลม"} but for the benefit for those in the forum I will
romanize it to "channangtaklom" (Thai language has no spaces between
words).

Now, I can query for the string "tak" but I can't search for "taklom",
what am I missing?


(ppearcy) #2

Hey,
This is a pretty lame answer, but I believe all this functionality
is inherited directly from lucene. You may be able to get some more
detailed answers on that discussion group. These ticket seems to track
the ticket introducing this support, which may have some answers:
https://issues.apache.org/jira/browse/LUCENE-503

Again, this isn't the answer I'd prefer to give, but if none of these
yields much info, I'd recommend digging into the code.

Hope this is at least marginally helpful :slight_smile:

Best Regards,
Paul

On Sep 18, 9:47 pm, Mishari misha...@gmail.com wrote:

Hi,

I'm having an issue with theThaiLanguage Analyzer I have the
following mapping defined:

{ u'parsedtext': { 'index': 'analyzed', 'store': 'yes', 'type':
u'string', 'index_analyzer': u'thai', 'search_analyzer': u'thai' } }

and I've submitted the following sentence for indexing {"parsedtext":
u"©Ñ¹¹Ñ觵ҡÅÁ"} but for the benefit for those in the forum I will
romanize it to "channangtaklom" (Thailanguage has no spaces between
words).

Now, I can query for the string "tak" but I can't search for "taklom",
what am I missing?


(Shay Banon) #3

I had a chat with Mishari on IRC, and something is strange since the
behavior from the stock Lucene ThaiAnalyzer was not as he expected when used
with elasticsearch (though really, it just delegates to it). Mishari, any
updates?

On Wed, Sep 21, 2011 at 1:11 AM, ppearcy ppearcy@gmail.com wrote:

Hey,
This is a pretty lame answer, but I believe all this functionality
is inherited directly from lucene. You may be able to get some more
detailed answers on that discussion group. These ticket seems to track
the ticket introducing this support, which may have some answers:
https://issues.apache.org/jira/browse/LUCENE-503

Again, this isn't the answer I'd prefer to give, but if none of these
yields much info, I'd recommend digging into the code.

Hope this is at least marginally helpful :slight_smile:

Best Regards,
Paul

On Sep 18, 9:47 pm, Mishari misha...@gmail.com wrote:

Hi,

I'm having an issue with theThaiLanguage Analyzer I have the
following mapping defined:

{ u'parsedtext': { 'index': 'analyzed', 'store': 'yes', 'type':
u'string', 'index_analyzer': u'thai', 'search_analyzer': u'thai' } }

and I've submitted the following sentence for indexing {"parsedtext":
u"©Ñ¹¹Ñ觵ҡÅÁ"} but for the benefit for those in the forum I will
romanize it to "channangtaklom" (Thailanguage has no spaces between
words).

Now, I can query for the string "tak" but I can't search for "taklom",
what am I missing?


(Mishari) #4

Hi,

It took a while before I could figure out what's going on. It seems
that the tokenizer would come across a transliterated word and would
either prepend a word or append one to it, so I suppose I should start
digging into lucene then. Question is, if I fix the bug, then how can
I get the patch into elasticsearch for use asap?

On Sep 21, 6:15 am, Shay Banon kim...@gmail.com wrote:

I had a chat with Mishari on IRC, and something is strange since the
behavior from the stock Lucene ThaiAnalyzer was not as he expected when used
with elasticsearch (though really, it just delegates to it). Mishari, any
updates?

On Wed, Sep 21, 2011 at 1:11 AM, ppearcy ppea...@gmail.com wrote:

Hey,
This is a pretty lame answer, but I believe all this functionality
is inherited directly from lucene. You may be able to get some more
detailed answers on that discussion group. These ticket seems to track
the ticket introducing this support, which may have some answers:
https://issues.apache.org/jira/browse/LUCENE-503

Again, this isn't the answer I'd prefer to give, but if none of these
yields much info, I'd recommend digging into the code.

Hope this is at least marginally helpful :slight_smile:

Best Regards,
Paul

On Sep 18, 9:47 pm, Mishari misha...@gmail.com wrote:

Hi,

I'm having an issue with theThaiLanguage Analyzer I have the
following mapping defined:

{ u'parsedtext': { 'index': 'analyzed', 'store': 'yes', 'type':
u'string', 'index_analyzer': u'thai', 'search_analyzer': u'thai' } }

and I've submitted the following sentence for indexing {"parsedtext":
u"©Ñ¹¹Ñ觵ҡÅÁ"} but for the benefit for those in the forum I will
romanize it to "channangtaklom" (Thailanguage has no spaces between
words).

Now, I can query for the string "tak" but I can't search for "taklom",
what am I missing?


(Shay Banon) #5

If you manage to fix it, you can either have your own analyzers build from
lucene and replace the lucene jar that comes with elasticsearch, or, create
your own custom analyzer that is registered with elasticsearch.

On Tue, Oct 18, 2011 at 8:27 PM, Mishari misharim@gmail.com wrote:

Hi,

It took a while before I could figure out what's going on. It seems
that the tokenizer would come across a transliterated word and would
either prepend a word or append one to it, so I suppose I should start
digging into lucene then. Question is, if I fix the bug, then how can
I get the patch into elasticsearch for use asap?

On Sep 21, 6:15 am, Shay Banon kim...@gmail.com wrote:

I had a chat with Mishari on IRC, and something is strange since the
behavior from the stock Lucene ThaiAnalyzer was not as he expected when
used
with elasticsearch (though really, it just delegates to it). Mishari, any
updates?

On Wed, Sep 21, 2011 at 1:11 AM, ppearcy ppea...@gmail.com wrote:

Hey,
This is a pretty lame answer, but I believe all this functionality
is inherited directly from lucene. You may be able to get some more
detailed answers on that discussion group. These ticket seems to track
the ticket introducing this support, which may have some answers:
https://issues.apache.org/jira/browse/LUCENE-503

Again, this isn't the answer I'd prefer to give, but if none of these
yields much info, I'd recommend digging into the code.

Hope this is at least marginally helpful :slight_smile:

Best Regards,
Paul

On Sep 18, 9:47 pm, Mishari misha...@gmail.com wrote:

Hi,

I'm having an issue with theThaiLanguage Analyzer I have the
following mapping defined:

{ u'parsedtext': { 'index': 'analyzed', 'store': 'yes', 'type':
u'string', 'index_analyzer': u'thai', 'search_analyzer': u'thai' } }

and I've submitted the following sentence for indexing {"parsedtext":
u"©Ñ¹¹Ñ觵ҡÅÁ"} but for the benefit for those in the forum I will
romanize it to "channangtaklom" (Thailanguage has no spaces between
words).

Now, I can query for the string "tak" but I can't search for
"taklom",

what am I missing?


(system) #6