Extending Thai analyzer

Min_Cha · February 7, 2014, 5:59am

Hi folks.

I would like to develop for a searching system for Thai language.
First of all, I found Thai analyzer and it seemed like good.

Actually, but, It doesn`t meet my whole requirement.
I decided what extends it.
For example, I would like to add nGram token filter on the Thai analyzer
without any changes on it.

How to do this?
Please, give me some advice.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5041f397-8732-413f-8e50-46e25610c639%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

nik9000 · February 7, 2014, 2:17pm

If you don't like the language analyzer you have to rebuild it as a custom
analyzer then add what you need to it.

{
"analyzer": {
"thai_with_ngram": {
"type": "custom",
"tokenizer": "standard",
"filters": ["standard", "lowercase", "thai", "thai_stop", "ngram"]
}
},
"filter": {
"thai": {
"type": "org.apache.lucene.analysis.th.ThaiWordFilterFactory"
},
"thai_stop": {
"type": "stop",
"stopwords_path": "org/apache/lucene/analysis/th/stopwords.txt"
},
"ngram": { your ngram configuration here }
}
}

Builds it with your ngram configuration. I think. I'm taking quite a few
educated guesses here so I expect you to have to fiddle with it to get it
right.

How I did this:

Open the class called ThaiAnalyzer in the Lucene version Elasticsearch
is using and find the method called createComponents. For me this is
simple because I have Elasticsearch open in Eclipse.
That method defines the tokenizer (standard) and some filters
(standard, lowercase, ThaiWordFilter, and stop. You have to be able to
translate the class names to Elasticsearch's easier names to get this to
work properly.
Now build it as a custom filter with your extra filter in there. That
is "thai_with_ngram" above.
Next you'll need to define all the filters that don't exist by default
in Elasticsearch. In this case that is thai, thai_stop, and your ngram
filter. In order:
The thai filter doesn't have an easy Elasticsearch mapping so you have
to tell Elasticsearch the class name to load. That class doesn't take an
configuration so we're done.
The thai_stop filter is just a regular stop word filter with thai stop
words. But Elasticserach doesn't have an easy name to reference the thai
stop words file. That isn't too bad, as you can load the stopwords file
from the classepath. It lives in Lucene at the path I added above.
The ngram filter is yours to build but it is well documented.

That took longer then I expected but it was worth the exercise so I'll
remember how to do it again when I need it. For reference, I do it for
English which has more filters but they all have easy names.

Nik

On Fri, Feb 7, 2014 at 12:59 AM, Min Cha minslovey@gmail.com wrote:

Hi folks.

I would like to develop for a searching system for Thai language.
First of all, I found Thai analyzer and it seemed like good.

Actually, but, It doesn`t meet my whole requirement.
I decided what extends it.
For example, I would like to add nGram token filter on the Thai analyzer
without any changes on it.

How to do this?
Please, give me some advice.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5041f397-8732-413f-8e50-46e25610c639%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3AsKcZP9H0exHFbMzeLeZJhi8TfN8-pBRwu2rkkU29Dw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Min_Cha · February 14, 2014, 2:06am

Hello Nik.
Thanks for your advice.

I had just tried as you advice. But, I met an error as following.

"error": "IndexCreationException[[search] failed to create index]; nested:
CreationException[Guice creation errors:\n\n1) Could not find a suitable
constructor in org.apache.lucene.analysis.th.ThaiWordFilterFactory. Classes
must have either one (and only one) constructor annotated with @Inject or a
zero-argument constructor that is not private.\n at
org.apache.lucene.analysis.th.ThaiWordFilterFactory.class(Unknown Source)\n
at
org.elasticsearch.index.analysis.TokenFilterFactoryFactory.create(Unknown
Source)\n at
org.elasticsearch.common.inject.assistedinject.FactoryProvider2.initialize(Unknown
Source)\n at unknown\n\n1 error]; ",

In my opnion, this error raises by ThaiWordFilterFactory which has`t a
zeo-argument constructor. In fact, the ThaiWordFilterFactory has only a
following constructor.

/** Creates a new ThaiWordFilterFactory */
public ThaiWordFilterFactory(Map<String,String> args) {
super(args);
assureMatchVersion();
if (!args.isEmpty()) {
throw new IllegalArgumentException("Unknown parameters: " + args);
}
}

If you don`t mind, I have an one more question. Can I define a constructor
argument in above settings JSON.

2014년 2월 7일 금요일 오후 11시 17분 59초 UTC+9, Nikolas Everett 님의 말:

If you don't like the language analyzer you have to rebuild it as a custom
analyzer then add what you need to it.

{
"analyzer": {
"thai_with_ngram": {
"type": "custom",
"tokenizer": "standard",
"filters": ["standard", "lowercase", "thai", "thai_stop", "ngram"]
}
},
"filter": {
"thai": {
"type": "org.apache.lucene.analysis.th.ThaiWordFilterFactory"
},
"thai_stop": {
"type": "stop",
"stopwords_path": "org/apache/lucene/analysis/th/stopwords.txt"
},
"ngram": { your ngram configuration here }
}
}

Builds it with your ngram configuration. I think. I'm taking quite a few
educated guesses here so I expect you to have to fiddle with it to get it
right.

How I did this:

Open the class called ThaiAnalyzer in the Lucene version Elasticsearch
is using and find the method called createComponents. For me this is
simple because I have Elasticsearch open in Eclipse.

That method defines the tokenizer (standard) and some filters
(standard, lowercase, ThaiWordFilter, and stop. You have to be able to
translate the class names to Elasticsearch's easier names to get this to
work properly.

Now build it as a custom filter with your extra filter in there. That
is "thai_with_ngram" above.

Next you'll need to define all the filters that don't exist by default
in Elasticsearch. In this case that is thai, thai_stop, and your ngram
filter. In order:

The thai filter doesn't have an easy Elasticsearch mapping so you have
to tell Elasticsearch the class name to load. That class doesn't take an
configuration so we're done.

The thai_stop filter is just a regular stop word filter with thai stop
words. But Elasticserach doesn't have an easy name to reference the thai
stop words file. That isn't too bad, as you can load the stopwords file
from the classepath. It lives in Lucene at the path I added above.

The ngram filter is yours to build but it is well documented.

That took longer then I expected but it was worth the exercise so I'll
remember how to do it again when I need it. For reference, I do it for
English which has more filters but they all have easy names.

Nik

On Fri, Feb 7, 2014 at 12:59 AM, Min Cha <mins...@gmail.com <javascript:>>wrote:

Hi folks.

I would like to develop for a searching system for Thai language.
First of all, I found Thai analyzer and it seemed like good.

Actually, but, It doesn`t meet my whole requirement.
I decided what extends it.
For example, I would like to add nGram token filter on the Thai analyzer
without any changes on it.

How to do this?
Please, give me some advice.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5041f397-8732-413f-8e50-46e25610c639%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fc05b477-2673-4d41-b611-96874005e379%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

spinscale · February 14, 2014, 9:55am

Hey,

the standard thai analyzer supports a stopwords_path in the mapping, no
need to reference to that ThaiWordFilterFactory...
Should help you.

--Alex

On Fri, Feb 14, 2014 at 3:06 AM, Min Cha minslovey@gmail.com wrote:

Hello Nik.
Thanks for your advice.

I had just tried as you advice. But, I met an error as following.

"error": "IndexCreationException[[search] failed to create index]; nested:
CreationException[Guice creation errors:\n\n1) Could not find a suitable
constructor in org.apache.lucene.analysis.th.ThaiWordFilterFactory. Classes
must have either one (and only one) constructor annotated with @Inject or a
zero-argument constructor that is not private.\n at
org.apache.lucene.analysis.th.ThaiWordFilterFactory.class(Unknown Source)\n
at
org.elasticsearch.index.analysis.TokenFilterFactoryFactory.create(Unknown
Source)\n at
org.elasticsearch.common.inject.assistedinject.FactoryProvider2.initialize(Unknown
Source)\n at unknown\n\n1 error]; ",

In my opnion, this error raises by ThaiWordFilterFactory which has`t a
zeo-argument constructor. In fact, the ThaiWordFilterFactory has only a
following constructor.

/** Creates a new ThaiWordFilterFactory */
public ThaiWordFilterFactory(Map<String,String> args) {
super(args);
assureMatchVersion();
if (!args.isEmpty()) {
throw new IllegalArgumentException("Unknown parameters: " + args);
}
}

If you don`t mind, I have an one more question. Can I define a constructor
argument in above settings JSON.

2014년 2월 7일 금요일 오후 11시 17분 59초 UTC+9, Nikolas Everett 님의 말:

If you don't like the language analyzer you have to rebuild it as a
custom analyzer then add what you need to it.

{
"analyzer": {
"thai_with_ngram": {
"type": "custom",
"tokenizer": "standard",
"filters": ["standard", "lowercase", "thai", "thai_stop", "ngram"]
}
},
"filter": {
"thai": {
"type": "org.apache.lucene.analysis.th.ThaiWordFilterFactory"
},
"thai_stop": {
"type": "stop",
"stopwords_path": "org/apache/lucene/analysis/th/stopwords.txt"
},
"ngram": { your ngram configuration here }
}
}

Builds it with your ngram configuration. I think. I'm taking quite a
few educated guesses here so I expect you to have to fiddle with it to get
it right.

How I did this:

Open the class called ThaiAnalyzer in the Lucene version
Elasticsearch is using and find the method called createComponents. For me
this is simple because I have Elasticsearch open in Eclipse.

That method defines the tokenizer (standard) and some filters
(standard, lowercase, ThaiWordFilter, and stop. You have to be able to
translate the class names to Elasticsearch's easier names to get this to
work properly.

Now build it as a custom filter with your extra filter in there.
That is "thai_with_ngram" above.

Next you'll need to define all the filters that don't exist by
default in Elasticsearch. In this case that is thai, thai_stop, and your
ngram filter. In order:

The thai filter doesn't have an easy Elasticsearch mapping so you
have to tell Elasticsearch the class name to load. That class doesn't take
an configuration so we're done.

The thai_stop filter is just a regular stop word filter with thai
stop words. But Elasticserach doesn't have an easy name to reference the
thai stop words file. That isn't too bad, as you can load the stopwords
file from the classepath. It lives in Lucene at the path I added above.

The ngram filter is yours to build but it is well documented.

That took longer then I expected but it was worth the exercise so I'll
remember how to do it again when I need it. For reference, I do it for
English which has more filters but they all have easy names.

Nik

On Fri, Feb 7, 2014 at 12:59 AM, Min Cha mins...@gmail.com wrote:

Hi folks.

I would like to develop for a searching system for Thai language.
First of all, I found Thai analyzer and it seemed like good.

Actually, but, It doesn`t meet my whole requirement.
I decided what extends it.
For example, I would like to add nGram token filter on the Thai analyzer
without any changes on it.

How to do this?
Please, give me some advice.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/5041f397-8732-413f-8e50-46e25610c639%
40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/fc05b477-2673-4d41-b611-96874005e379%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCwEM-KbjTs%3DahHHYcj%3D51RQxt-o9Mj1-DfPMzMY-JOKGMCmA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Min_Cha · February 14, 2014, 10:48am

Thanks.

If you dont mind, can you give me a specific example or explain more
specific?
I cant`t understand your advice.

2014년 2월 14일 금요일 오후 6시 55분 44초 UTC+9, Alexander Reelsen 님의 말:

Hey,

the standard thai analyzer supports a stopwords_path in the mapping, no
need to reference to that ThaiWordFilterFactory...
Should help you.

--Alex

On Fri, Feb 14, 2014 at 3:06 AM, Min Cha <mins...@gmail.com <javascript:>>wrote:

Hello Nik.
Thanks for your advice.

I had just tried as you advice. But, I met an error as following.

"error": "IndexCreationException[[search] failed to create index];
nested: CreationException[Guice creation errors:\n\n1) Could not find a
suitable constructor in
org.apache.lucene.analysis.th.ThaiWordFilterFactory. Classes must have
either one (and only one) constructor annotated with @Inject or a
zero-argument constructor that is not private.\n at
org.apache.lucene.analysis.th.ThaiWordFilterFactory.class(Unknown Source)\n
at
org.elasticsearch.index.analysis.TokenFilterFactoryFactory.create(Unknown
Source)\n at
org.elasticsearch.common.inject.assistedinject.FactoryProvider2.initialize(Unknown
Source)\n at unknown\n\n1 error]; ",

In my opnion, this error raises by ThaiWordFilterFactory which has`t a
zeo-argument constructor. In fact, the ThaiWordFilterFactory has only a
following constructor.

/** Creates a new ThaiWordFilterFactory */
public ThaiWordFilterFactory(Map<String,String> args) {
super(args);
assureMatchVersion();
if (!args.isEmpty()) {
throw new IllegalArgumentException("Unknown parameters: " + args);
}
}

If you don`t mind, I have an one more question. Can I define a
constructor argument in above settings JSON.

2014년 2월 7일 금요일 오후 11시 17분 59초 UTC+9, Nikolas Everett 님의 말:

If you don't like the language analyzer you have to rebuild it as a
custom analyzer then add what you need to it.

{
"analyzer": {
"thai_with_ngram": {
"type": "custom",
"tokenizer": "standard",
"filters": ["standard", "lowercase", "thai", "thai_stop", "ngram"]
}
},
"filter": {
"thai": {
"type": "org.apache.lucene.analysis.th.ThaiWordFilterFactory"
},
"thai_stop": {
"type": "stop",
"stopwords_path": "org/apache/lucene/analysis/th/stopwords.txt"
},
"ngram": { your ngram configuration here }
}
}

Builds it with your ngram configuration. I think. I'm taking quite a
few educated guesses here so I expect you to have to fiddle with it to get
it right.

How I did this:

Open the class called ThaiAnalyzer in the Lucene version
Elasticsearch is using and find the method called createComponents. For me
this is simple because I have Elasticsearch open in Eclipse.

That method defines the tokenizer (standard) and some filters
(standard, lowercase, ThaiWordFilter, and stop. You have to be able to
translate the class names to Elasticsearch's easier names to get this to
work properly.

Now build it as a custom filter with your extra filter in there.
That is "thai_with_ngram" above.

Next you'll need to define all the filters that don't exist by
default in Elasticsearch. In this case that is thai, thai_stop, and your
ngram filter. In order:

The thai filter doesn't have an easy Elasticsearch mapping so you
have to tell Elasticsearch the class name to load. That class doesn't take
an configuration so we're done.

The thai_stop filter is just a regular stop word filter with thai
stop words. But Elasticserach doesn't have an easy name to reference the
thai stop words file. That isn't too bad, as you can load the stopwords
file from the classepath. It lives in Lucene at the path I added above.

The ngram filter is yours to build but it is well documented.

That took longer then I expected but it was worth the exercise so I'll
remember how to do it again when I need it. For reference, I do it for
English which has more filters but they all have easy names.

Nik

On Fri, Feb 7, 2014 at 12:59 AM, Min Cha mins...@gmail.com wrote:

Hi folks.

I would like to develop for a searching system for Thai language.
First of all, I found Thai analyzer and it seemed like good.

Actually, but, It doesn`t meet my whole requirement.
I decided what extends it.
For example, I would like to add nGram token filter on the Thai
analyzer without any changes on it.

How to do this?
Please, give me some advice.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/5041f397-8732-413f-8e50-46e25610c639%
40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/fc05b477-2673-4d41-b611-96874005e379%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/51becbc3-fa57-4bac-a6de-6efd153f7756%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Extending based on Thai language analyzer Elasticsearch	2	1155	February 17, 2014
Adding NGram to language analyzer Elasticsearch	0	351	July 29, 2013
How to use my customer lucene analyzer(tokenizer)? Elasticsearch	5	1123	August 26, 2014
Adding NGram to language analyzer Elasticsearch	5	1433	May 30, 2014
Override built-in analyzer Elasticsearch	5	517	December 4, 2013

Extending Thai analyzer

Related topics