Extending Thai analyzer


(Min Cha) #1

Hi folks.

I would like to develop for a searching system for Thai language.
First of all, I found Thai analyzer and it seemed like good.

Actually, but, It doesn`t meet my whole requirement.
I decided what extends it.
For example, I would like to add nGram token filter on the Thai analyzer
without any changes on it.

How to do this?
Please, give me some advice.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5041f397-8732-413f-8e50-46e25610c639%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Nik Everett) #2

If you don't like the language analyzer you have to rebuild it as a custom
analyzer then add what you need to it.

{
"analyzer": {
"thai_with_ngram": {
"type": "custom",
"tokenizer": "standard",
"filters": ["standard", "lowercase", "thai", "thai_stop", "ngram"]
}
},
"filter": {
"thai": {
"type": "org.apache.lucene.analysis.th.ThaiWordFilterFactory"
},
"thai_stop": {
"type": "stop",
"stopwords_path": "org/apache/lucene/analysis/th/stopwords.txt"
},
"ngram": { your ngram configuration here }
}
}

Builds it with your ngram configuration. I think. I'm taking quite a few
educated guesses here so I expect you to have to fiddle with it to get it
right.

How I did this:

  1. Open the class called ThaiAnalyzer in the Lucene version Elasticsearch
    is using and find the method called createComponents. For me this is
    simple because I have Elasticsearch open in Eclipse.
  2. That method defines the tokenizer (standard) and some filters
    (standard, lowercase, ThaiWordFilter, and stop. You have to be able to
    translate the class names to Elasticsearch's easier names to get this to
    work properly.
  3. Now build it as a custom filter with your extra filter in there. That
    is "thai_with_ngram" above.
  4. Next you'll need to define all the filters that don't exist by default
    in Elasticsearch. In this case that is thai, thai_stop, and your ngram
    filter. In order:
  5. The thai filter doesn't have an easy Elasticsearch mapping so you have
    to tell Elasticsearch the class name to load. That class doesn't take an
    configuration so we're done.
  6. The thai_stop filter is just a regular stop word filter with thai stop
    words. But Elasticserach doesn't have an easy name to reference the thai
    stop words file. That isn't too bad, as you can load the stopwords file
    from the classepath. It lives in Lucene at the path I added above.
  7. The ngram filter is yours to build but it is well documented.

That took longer then I expected but it was worth the exercise so I'll
remember how to do it again when I need it. For reference, I do it for
English which has more filters but they all have easy names.

Nik

On Fri, Feb 7, 2014 at 12:59 AM, Min Cha minslovey@gmail.com wrote:

Hi folks.

I would like to develop for a searching system for Thai language.
First of all, I found Thai analyzer and it seemed like good.

Actually, but, It doesn`t meet my whole requirement.
I decided what extends it.
For example, I would like to add nGram token filter on the Thai analyzer
without any changes on it.

How to do this?
Please, give me some advice.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5041f397-8732-413f-8e50-46e25610c639%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3AsKcZP9H0exHFbMzeLeZJhi8TfN8-pBRwu2rkkU29Dw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Min Cha) #3

Hello Nik.
Thanks for your advice.

I had just tried as you advice. But, I met an error as following.

"error": "IndexCreationException[[search] failed to create index]; nested:
CreationException[Guice creation errors:\n\n1) Could not find a suitable
constructor in org.apache.lucene.analysis.th.ThaiWordFilterFactory. Classes
must have either one (and only one) constructor annotated with @Inject or a
zero-argument constructor that is not private.\n at
org.apache.lucene.analysis.th.ThaiWordFilterFactory.class(Unknown Source)\n
at
org.elasticsearch.index.analysis.TokenFilterFactoryFactory.create(Unknown
Source)\n at
org.elasticsearch.common.inject.assistedinject.FactoryProvider2.initialize(Unknown
Source)\n at unknown\n\n1 error]; ",

In my opnion, this error raises by ThaiWordFilterFactory which has`t a
zeo-argument constructor. In fact, the ThaiWordFilterFactory has only a
following constructor.

/** Creates a new ThaiWordFilterFactory */
public ThaiWordFilterFactory(Map<String,String> args) {
super(args);
assureMatchVersion();
if (!args.isEmpty()) {
throw new IllegalArgumentException("Unknown parameters: " + args);
}
}

If you don`t mind, I have an one more question. Can I define a constructor
argument in above settings JSON.

2014년 2월 7일 금요일 오후 11시 17분 59초 UTC+9, Nikolas Everett 님의 말:

If you don't like the language analyzer you have to rebuild it as a custom
analyzer then add what you need to it.

{
"analyzer": {
"thai_with_ngram": {
"type": "custom",
"tokenizer": "standard",
"filters": ["standard", "lowercase", "thai", "thai_stop", "ngram"]
}
},
"filter": {
"thai": {
"type": "org.apache.lucene.analysis.th.ThaiWordFilterFactory"
},
"thai_stop": {
"type": "stop",
"stopwords_path": "org/apache/lucene/analysis/th/stopwords.txt"
},
"ngram": { your ngram configuration here }
}
}

Builds it with your ngram configuration. I think. I'm taking quite a few
educated guesses here so I expect you to have to fiddle with it to get it
right.

How I did this:

  1. Open the class called ThaiAnalyzer in the Lucene version Elasticsearch
    is using and find the method called createComponents. For me this is
    simple because I have Elasticsearch open in Eclipse.
  2. That method defines the tokenizer (standard) and some filters
    (standard, lowercase, ThaiWordFilter, and stop. You have to be able to
    translate the class names to Elasticsearch's easier names to get this to
    work properly.
  3. Now build it as a custom filter with your extra filter in there. That
    is "thai_with_ngram" above.
  4. Next you'll need to define all the filters that don't exist by default
    in Elasticsearch. In this case that is thai, thai_stop, and your ngram
    filter. In order:
  5. The thai filter doesn't have an easy Elasticsearch mapping so you have
    to tell Elasticsearch the class name to load. That class doesn't take an
    configuration so we're done.
  6. The thai_stop filter is just a regular stop word filter with thai stop
    words. But Elasticserach doesn't have an easy name to reference the thai
    stop words file. That isn't too bad, as you can load the stopwords file
    from the classepath. It lives in Lucene at the path I added above.
  7. The ngram filter is yours to build but it is well documented.

That took longer then I expected but it was worth the exercise so I'll
remember how to do it again when I need it. For reference, I do it for
English which has more filters but they all have easy names.

Nik

On Fri, Feb 7, 2014 at 12:59 AM, Min Cha <mins...@gmail.com <javascript:>>wrote:

Hi folks.

I would like to develop for a searching system for Thai language.
First of all, I found Thai analyzer and it seemed like good.

Actually, but, It doesn`t meet my whole requirement.
I decided what extends it.
For example, I would like to add nGram token filter on the Thai analyzer
without any changes on it.

How to do this?
Please, give me some advice.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5041f397-8732-413f-8e50-46e25610c639%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fc05b477-2673-4d41-b611-96874005e379%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Alexander Reelsen) #4

Hey,

the standard thai analyzer supports a stopwords_path in the mapping, no
need to reference to that ThaiWordFilterFactory...
Should help you.

--Alex

On Fri, Feb 14, 2014 at 3:06 AM, Min Cha minslovey@gmail.com wrote:

Hello Nik.
Thanks for your advice.

I had just tried as you advice. But, I met an error as following.

"error": "IndexCreationException[[search] failed to create index]; nested:
CreationException[Guice creation errors:\n\n1) Could not find a suitable
constructor in org.apache.lucene.analysis.th.ThaiWordFilterFactory. Classes
must have either one (and only one) constructor annotated with @Inject or a
zero-argument constructor that is not private.\n at
org.apache.lucene.analysis.th.ThaiWordFilterFactory.class(Unknown Source)\n
at
org.elasticsearch.index.analysis.TokenFilterFactoryFactory.create(Unknown
Source)\n at
org.elasticsearch.common.inject.assistedinject.FactoryProvider2.initialize(Unknown
Source)\n at unknown\n\n1 error]; ",

In my opnion, this error raises by ThaiWordFilterFactory which has`t a
zeo-argument constructor. In fact, the ThaiWordFilterFactory has only a
following constructor.

/** Creates a new ThaiWordFilterFactory */
public ThaiWordFilterFactory(Map<String,String> args) {
super(args);
assureMatchVersion();
if (!args.isEmpty()) {
throw new IllegalArgumentException("Unknown parameters: " + args);
}
}

If you don`t mind, I have an one more question. Can I define a constructor
argument in above settings JSON.

2014년 2월 7일 금요일 오후 11시 17분 59초 UTC+9, Nikolas Everett 님의 말:

If you don't like the language analyzer you have to rebuild it as a
custom analyzer then add what you need to it.

{
"analyzer": {
"thai_with_ngram": {
"type": "custom",
"tokenizer": "standard",
"filters": ["standard", "lowercase", "thai", "thai_stop", "ngram"]
}
},
"filter": {
"thai": {
"type": "org.apache.lucene.analysis.th.ThaiWordFilterFactory"
},
"thai_stop": {
"type": "stop",
"stopwords_path": "org/apache/lucene/analysis/th/stopwords.txt"
},
"ngram": { your ngram configuration here }
}
}

Builds it with your ngram configuration. I think. I'm taking quite a
few educated guesses here so I expect you to have to fiddle with it to get
it right.

How I did this:

  1. Open the class called ThaiAnalyzer in the Lucene version
    Elasticsearch is using and find the method called createComponents. For me
    this is simple because I have Elasticsearch open in Eclipse.
  2. That method defines the tokenizer (standard) and some filters
    (standard, lowercase, ThaiWordFilter, and stop. You have to be able to
    translate the class names to Elasticsearch's easier names to get this to
    work properly.
  3. Now build it as a custom filter with your extra filter in there.
    That is "thai_with_ngram" above.
  4. Next you'll need to define all the filters that don't exist by
    default in Elasticsearch. In this case that is thai, thai_stop, and your
    ngram filter. In order:
  5. The thai filter doesn't have an easy Elasticsearch mapping so you
    have to tell Elasticsearch the class name to load. That class doesn't take
    an configuration so we're done.
  6. The thai_stop filter is just a regular stop word filter with thai
    stop words. But Elasticserach doesn't have an easy name to reference the
    thai stop words file. That isn't too bad, as you can load the stopwords
    file from the classepath. It lives in Lucene at the path I added above.
  7. The ngram filter is yours to build but it is well documented.

That took longer then I expected but it was worth the exercise so I'll
remember how to do it again when I need it. For reference, I do it for
English which has more filters but they all have easy names.

Nik

On Fri, Feb 7, 2014 at 12:59 AM, Min Cha mins...@gmail.com wrote:

Hi folks.

I would like to develop for a searching system for Thai language.
First of all, I found Thai analyzer and it seemed like good.

Actually, but, It doesn`t meet my whole requirement.
I decided what extends it.
For example, I would like to add nGram token filter on the Thai analyzer
without any changes on it.

How to do this?
Please, give me some advice.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/5041f397-8732-413f-8e50-46e25610c639%
40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/fc05b477-2673-4d41-b611-96874005e379%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCwEM-KbjTs%3DahHHYcj%3D51RQxt-o9Mj1-DfPMzMY-JOKGMCmA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Min Cha) #5

Thanks.

If you dont mind, can you give me a specific example or explain more
specific?
I cant`t understand your advice.

2014년 2월 14일 금요일 오후 6시 55분 44초 UTC+9, Alexander Reelsen 님의 말:

Hey,

the standard thai analyzer supports a stopwords_path in the mapping, no
need to reference to that ThaiWordFilterFactory...
Should help you.

--Alex

On Fri, Feb 14, 2014 at 3:06 AM, Min Cha <mins...@gmail.com <javascript:>>wrote:

Hello Nik.
Thanks for your advice.

I had just tried as you advice. But, I met an error as following.

"error": "IndexCreationException[[search] failed to create index];
nested: CreationException[Guice creation errors:\n\n1) Could not find a
suitable constructor in
org.apache.lucene.analysis.th.ThaiWordFilterFactory. Classes must have
either one (and only one) constructor annotated with @Inject or a
zero-argument constructor that is not private.\n at
org.apache.lucene.analysis.th.ThaiWordFilterFactory.class(Unknown Source)\n
at
org.elasticsearch.index.analysis.TokenFilterFactoryFactory.create(Unknown
Source)\n at
org.elasticsearch.common.inject.assistedinject.FactoryProvider2.initialize(Unknown
Source)\n at unknown\n\n1 error]; ",

In my opnion, this error raises by ThaiWordFilterFactory which has`t a
zeo-argument constructor. In fact, the ThaiWordFilterFactory has only a
following constructor.

/** Creates a new ThaiWordFilterFactory */
public ThaiWordFilterFactory(Map<String,String> args) {
super(args);
assureMatchVersion();
if (!args.isEmpty()) {
throw new IllegalArgumentException("Unknown parameters: " + args);
}
}

If you don`t mind, I have an one more question. Can I define a
constructor argument in above settings JSON.

2014년 2월 7일 금요일 오후 11시 17분 59초 UTC+9, Nikolas Everett 님의 말:

If you don't like the language analyzer you have to rebuild it as a
custom analyzer then add what you need to it.

{
"analyzer": {
"thai_with_ngram": {
"type": "custom",
"tokenizer": "standard",
"filters": ["standard", "lowercase", "thai", "thai_stop", "ngram"]
}
},
"filter": {
"thai": {
"type": "org.apache.lucene.analysis.th.ThaiWordFilterFactory"
},
"thai_stop": {
"type": "stop",
"stopwords_path": "org/apache/lucene/analysis/th/stopwords.txt"
},
"ngram": { your ngram configuration here }
}
}

Builds it with your ngram configuration. I think. I'm taking quite a
few educated guesses here so I expect you to have to fiddle with it to get
it right.

How I did this:

  1. Open the class called ThaiAnalyzer in the Lucene version
    Elasticsearch is using and find the method called createComponents. For me
    this is simple because I have Elasticsearch open in Eclipse.
  2. That method defines the tokenizer (standard) and some filters
    (standard, lowercase, ThaiWordFilter, and stop. You have to be able to
    translate the class names to Elasticsearch's easier names to get this to
    work properly.
  3. Now build it as a custom filter with your extra filter in there.
    That is "thai_with_ngram" above.
  4. Next you'll need to define all the filters that don't exist by
    default in Elasticsearch. In this case that is thai, thai_stop, and your
    ngram filter. In order:
  5. The thai filter doesn't have an easy Elasticsearch mapping so you
    have to tell Elasticsearch the class name to load. That class doesn't take
    an configuration so we're done.
  6. The thai_stop filter is just a regular stop word filter with thai
    stop words. But Elasticserach doesn't have an easy name to reference the
    thai stop words file. That isn't too bad, as you can load the stopwords
    file from the classepath. It lives in Lucene at the path I added above.
  7. The ngram filter is yours to build but it is well documented.

That took longer then I expected but it was worth the exercise so I'll
remember how to do it again when I need it. For reference, I do it for
English which has more filters but they all have easy names.

Nik

On Fri, Feb 7, 2014 at 12:59 AM, Min Cha mins...@gmail.com wrote:

Hi folks.

I would like to develop for a searching system for Thai language.
First of all, I found Thai analyzer and it seemed like good.

Actually, but, It doesn`t meet my whole requirement.
I decided what extends it.
For example, I would like to add nGram token filter on the Thai
analyzer without any changes on it.

How to do this?
Please, give me some advice.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/5041f397-8732-413f-8e50-46e25610c639%
40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/fc05b477-2673-4d41-b611-96874005e379%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/51becbc3-fa57-4bac-a6de-6efd153f7756%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #6