Importing language analyzers


(Paweł Konieczny) #1

Hey!

Is it possible to import language analyzers from Lucene since ES is
built on top of it (and Lucene has definitely more languages supported
out of box)?
The list of languages supported by EA is pretty extensive too, but it
lacks polish language which I need :slight_smile:
EA seems much more flexible and has great potential (and I would like
to use it in a project I'm developing), but without support for a
given language it just won't do.
Also, who adds language support to EA, its developers or the
community?

Cheers,
Pawel


(Shay Banon) #2

Yes, you can hook your own analyzer, but you will need to implement a custom
class that provides it. Check for example the GermanAnalyzerProvider. What
is the name of the polish analyzer? I might have missed it and did not
include it out of the box.

2011/7/20 Paweł Konieczny koniecznypw@gmail.com

Hey!

Is it possible to import language analyzers from Lucene since ES is
built on top of it (and Lucene has definitely more languages supported
out of box)?
The list of languages supported by EA is pretty extensive too, but it
lacks polish language which I need :slight_smile:
EA seems much more flexible and has great potential (and I would like
to use it in a project I'm developing), but without support for a
given language it just won't do.
Also, who adds language support to EA, its developers or the
community?

Cheers,
Pawel


(Paweł Konieczny) #3

From what I understand, it's called Stempel and it's included in
Lucene.

On Jul 20, 6:50 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Yes, you can hook your own analyzer, but you will need to implement a custom
class that provides it. Check for example the GermanAnalyzerProvider. What
is the name of the polish analyzer? I might have missed it and did not
include it out of the box.

2011/7/20 Paweł Konieczny konieczn...@gmail.com

Hey!

Is it possible to import language analyzers from Lucene since ES is
built on top of it (and Lucene has definitely more languages supported
out of box)?
The list of languages supported by EA is pretty extensive too, but it
lacks polish language which I need :slight_smile:
EA seems much more flexible and has great potential (and I would like
to use it in a project I'm developing), but without support for a
given language it just won't do.
Also, who adds language support to EA, its developers or the
community?

Cheers,
Pawel


(ofavre) #4

I think we should review all the available analyzers available, and identify
the missing ones (ie not wrapped in ES).

I also found that one on the Internet for Chinese:
http://code.google.com/p/ik-analyzer/
With an ES plugin (at least a stub):


But it's not part of Lucene-contrib.

Here is what I found, part of Lucene-contrib (apparently only those few *
language* analyzers are missing) :

Some other findings:

The most interesting lists come from Solr itself:

http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-summary.html

I didn't look very thoroughly at those last two links, but it looks that we
may be missing:

Towards a small easy pull-request?

[1] http://snowball.tartarus.org/algorithms/kraaij_pohlmann/stemmer.html
     Seen from

http://wiki.apache.org/solr/LanguageAnalysis#Notes_about_solr.SnowballPorterFilterFactory
[2] http://en.wikipedia.org/wiki/Stemming#History

Olivier Favre

2011/7/21 Paweł Konieczny koniecznypw@gmail.com

From what I understand, it's called Stempel and it's included in
Lucene.

On Jul 20, 6:50 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Yes, you can hook your own analyzer, but you will need to implement a
custom
class that provides it. Check for example the GermanAnalyzerProvider.
What
is the name of the polish analyzer? I might have missed it and did not
include it out of the box.

2011/7/20 Paweł Konieczny konieczn...@gmail.com

Hey!

Is it possible to import language analyzers from Lucene since ES is
built on top of it (and Lucene has definitely more languages supported
out of box)?
The list of languages supported by EA is pretty extensive too, but it
lacks polish language which I need :slight_smile:
EA seems much more flexible and has great potential (and I would like
to use it in a project I'm developing), but without support for a
given language it just won't do.
Also, who adds language support to EA, its developers or the
community?

Cheers,
Pawel


(Shay Banon) #5

Heya,

Yea, open issues for the missing analyzers, the stempel one, for example,
should be simple to add (its in a different lib).

On Thu, Jul 21, 2011 at 5:55 PM, Olivier Favre olivier@yakaz.com wrote:

I think we should review all the available analyzers available, and
identify the missing ones (ie not wrapped in ES).

I also found that one on the Internet for Chinese:
http://code.google.com/p/ik-analyzer/
With an ES plugin (at least a stub):
https://github.com/medcl/elasticsearch/blob/21abad12a0096173e8836dd042ca403751ab7ad1/plugins/analysis/ik/src/main/java/org/elasticsearch/index/analysis/IkAnalyzer.java
But it's not part of Lucene-contrib.

Here is what I found, part of Lucene-contrib (apparently only those few *
language* analyzers are missing) :

Some other findings:

The most interesting lists come from Solr itself:

http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-summary.html

I didn't look very thoroughly at those last two links, but it looks that we
may be missing:

Towards a small easy pull-request?

[1]

http://snowball.tartarus.org/algorithms/kraaij_pohlmann/stemmer.html
Seen from
http://wiki.apache.org/solr/LanguageAnalysis#Notes_about_solr.SnowballPorterFilterFactory
[2] http://en.wikipedia.org/wiki/Stemming#History

Olivier Favre

www.yakaz.com

2011/7/21 Paweł Konieczny koniecznypw@gmail.com

From what I understand, it's called Stempel and it's included in
Lucene.

On Jul 20, 6:50 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Yes, you can hook your own analyzer, but you will need to implement a
custom
class that provides it. Check for example the GermanAnalyzerProvider.
What
is the name of the polish analyzer? I might have missed it and did not
include it out of the box.

2011/7/20 Paweł Konieczny konieczn...@gmail.com

Hey!

Is it possible to import language analyzers from Lucene since ES is
built on top of it (and Lucene has definitely more languages supported
out of box)?
The list of languages supported by EA is pretty extensive too, but it
lacks polish language which I need :slight_smile:
EA seems much more flexible and has great potential (and I would like
to use it in a project I'm developing), but without support for a
given language it just won't do.
Also, who adds language support to EA, its developers or the
community?

Cheers,
Pawel


(Paweł Konieczny) #6

So will the next release have them included out of box? I'm in no
hurry and I'd rather wait until someone does it properly.

Cheers,
Pawel

On Jul 22, 2:44 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

Yea, open issues for the missing analyzers, the stempel one, for example,
should be simple to add (its in a different lib).

On Thu, Jul 21, 2011 at 5:55 PM, Olivier Favre oliv...@yakaz.com wrote:

I think we should review all the available analyzers available, and
identify the missing ones (ie not wrapped in ES).

I also found that one on the Internet for Chinese:
http://code.google.com/p/ik-analyzer/
With an ES plugin (at least a stub):
https://github.com/medcl/elasticsearch/blob/21abad12a0096173e8836dd04...
But it's not part of Lucene-contrib.

Here is what I found, part of Lucene-contrib (apparently only those few *
language* analyzers are missing) :

Some other findings:

The most interesting lists come from Solr itself:

http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-su...
-http://wiki.apache.org/solr/LanguageAnalysis

I didn't look very thoroughly at those last two links, but it looks that we
may be missing:

Towards a small easy pull-request?

[1]

http://snowball.tartarus.org/algorithms/kraaij_pohlmann/stemmer.html
Seen from
http://wiki.apache.org/solr/LanguageAnalysis#Notes_about_solr.Snowbal...
[2]http://en.wikipedia.org/wiki/Stemming#History

Olivier Favre

www.yakaz.com

2011/7/21 Paweł Konieczny konieczn...@gmail.com

From what I understand, it's called Stempel and it's included in
Lucene.

On Jul 20, 6:50 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Yes, you can hook your own analyzer, but you will need to implement a
custom
class that provides it. Check for example the GermanAnalyzerProvider.
What
is the name of the polish analyzer? I might have missed it and did not
include it out of the box.

2011/7/20 Paweł Konieczny konieczn...@gmail.com

Hey!

Is it possible to import language analyzers from Lucene since ES is
built on top of it (and Lucene has definitely more languages supported
out of box)?
The list of languages supported by EA is pretty extensive too, but it
lacks polish language which I need :slight_smile:
EA seems much more flexible and has great potential (and I would like
to use it in a project I'm developing), but without support for a
given language it just won't do.
Also, who adds language support to EA, its developers or the
community?

Cheers,
Pawel


(medcl.net) #7

hey,i just write a post about how to customize an es plugin a few days
ago,but in chinese~ :frowning:

http://log.medcl.net/item/2011/07/diving-into-elasticsearch-3-编写自定义分词插件/

-----Original Message-----
From: PaweÂł Konieczny
Sent: Friday, July 22, 2011 3:52 PM
To: users
Subject: Re: Importing language analyzers

So will the next release have them included out of box? I'm in no
hurry and I'd rather wait until someone does it properly.

Cheers,
Pawel

On Jul 22, 2:44 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

Yea, open issues for the missing analyzers, the stempel one, for
example,
should be simple to add (its in a different lib).

On Thu, Jul 21, 2011 at 5:55 PM, Olivier Favre oliv...@yakaz.com wrote:

I think we should review all the available analyzers available, and
identify the missing ones (ie not wrapped in ES).

I also found that one on the Internet for Chinese:
http://code.google.com/p/ik-analyzer/
With an ES plugin (at least a stub):
https://github.com/medcl/elasticsearch/blob/21abad12a0096173e8836dd04...
But it's not part of Lucene-contrib.

Here is what I found, part of Lucene-contrib (apparently only those few
*
language* analyzers are missing) :

http://lucene.apache.org/java/3_3_0/api/contrib-analyzers/org/apache/...

Some other findings:

http://lucene.apache.org/java/3_3_0/api/contrib-analyzers/org/tartaru...

http://lucene.apache.org/java/3_3_0/api/contrib-analyzers/org/tartaru...

  • Wikipedia-syntax-aware tokenizer:

http://lucene.apache.org/java/3_3_0/api/contrib-analyzers/org/apache/...

The most interesting lists come from Solr itself:

http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-su...
-http://wiki.apache.org/solr/LanguageAnalysis

I didn't look very thoroughly at those last two links, but it looks that
we
may be missing:

  • ClassicTokenizer (may be deprecated or superseded by the
    StandardTokenizer, I have no idea):

https://builds.apache.org/job/Lucene-3.x/javadoc/all/org/apache/lucen...

  • CommonGrams:

http://lucene.apache.org/solr/api/org/apache/solr/analysis/CommonGram...

Towards a small easy pull-request?

[1]

http://snowball.tartarus.org/algorithms/kraaij_pohlmann/stemmer.html
Seen from
http://wiki.apache.org/solr/LanguageAnalysis#Notes_about_solr.Snowbal...
[2]http://en.wikipedia.org/wiki/Stemming#History

Olivier Favre

www.yakaz.com

2011/7/21 PaweÂł Konieczny konieczn...@gmail.com

From what I understand, it's called Stempel and it's included in
Lucene.

On Jul 20, 6:50 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Yes, you can hook your own analyzer, but you will need to implement a
custom
class that provides it. Check for example the GermanAnalyzerProvider.
What
is the name of the polish analyzer? I might have missed it and did
not
include it out of the box.

2011/7/20 PaweÂł Konieczny konieczn...@gmail.com

Hey!

Is it possible to import language analyzers from Lucene since ES
is
built on top of it (and Lucene has definitely more languages
supported
out of box)?
The list of languages supported by EA is pretty extensive too, but
it
lacks polish language which I need :slight_smile:
EA seems much more flexible and has great potential (and I would
like
to use it in a project I'm developing), but without support for a
given language it just won't do.
Also, who adds language support to EA, its developers or the
community?

Cheers,
Pawel


(Shay Banon) #8

It can have it, sure, just make sure to open issues for the relevant
analyzers.

2011/7/22 Paweł Konieczny koniecznypw@gmail.com

So will the next release have them included out of box? I'm in no
hurry and I'd rather wait until someone does it properly.

Cheers,
Pawel

On Jul 22, 2:44 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

Yea, open issues for the missing analyzers, the stempel one, for
example,
should be simple to add (its in a different lib).

On Thu, Jul 21, 2011 at 5:55 PM, Olivier Favre oliv...@yakaz.com
wrote:

I think we should review all the available analyzers available, and
identify the missing ones (ie not wrapped in ES).

I also found that one on the Internet for Chinese:
http://code.google.com/p/ik-analyzer/
With an ES plugin (at least a stub):
https://github.com/medcl/elasticsearch/blob/21abad12a0096173e8836dd04.
..

But it's not part of Lucene-contrib.

Here is what I found, part of Lucene-contrib (apparently only those few

language* analyzers are missing) :

http://lucene.apache.org/java/3_3_0/api/contrib-analyzers/org/apache/...

Some other findings:

?):

http://lucene.apache.org/java/3_3_0/api/contrib-analyzers/org/tartaru...

http://lucene.apache.org/java/3_3_0/api/contrib-analyzers/org/tartaru...

  • Wikipedia-syntax-aware tokenizer:

http://lucene.apache.org/java/3_3_0/api/contrib-analyzers/org/apache/...

The most interesting lists come from Solr itself:

http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-su...

-http://wiki.apache.org/solr/LanguageAnalysis

I didn't look very thoroughly at those last two links, but it looks
that we

may be missing:

  • ClassicTokenizer (may be deprecated or superseded by the
    StandardTokenizer, I have no idea):

https://builds.apache.org/job/Lucene-3.x/javadoc/all/org/apache/lucen...

  • CommonGrams:

http://lucene.apache.org/solr/api/org/apache/solr/analysis/CommonGram...

  • Lao, Myanmar, Khmer - seem to only split in syllables:

http://wiki.apache.org/solr/LanguageAnalysis#Lao.2C_Myanmar.2C_Khmer

Towards a small easy pull-request?

[1]

http://snowball.tartarus.org/algorithms/kraaij_pohlmann/stemmer.html
Seen from
http://wiki.apache.org/solr/LanguageAnalysis#Notes_about_solr.Snowbal.
..

[2]http://en.wikipedia.org/wiki/Stemming#History

--
Olivier Favre

www.yakaz.com

2011/7/21 Paweł Konieczny konieczn...@gmail.com

From what I understand, it's called Stempel and it's included in
Lucene.

On Jul 20, 6:50 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Yes, you can hook your own analyzer, but you will need to implement
a

custom

class that provides it. Check for example the
GermanAnalyzerProvider.

What

is the name of the polish analyzer? I might have missed it and did
not

include it out of the box.

2011/7/20 Paweł Konieczny konieczn...@gmail.com

Hey!

Is it possible to import language analyzers from Lucene since ES
is

built on top of it (and Lucene has definitely more languages
supported

out of box)?
The list of languages supported by EA is pretty extensive too, but
it

lacks polish language which I need :slight_smile:
EA seems much more flexible and has great potential (and I would
like

to use it in a project I'm developing), but without support for a
given language it just won't do.
Also, who adds language support to EA, its developers or the
community?

Cheers,
Pawel


(system) #9