Importing language analyzers

Hey!

Is it possible to import language analyzers from Lucene since ES is
built on top of it (and Lucene has definitely more languages supported
out of box)?
The list of languages supported by EA is pretty extensive too, but it
lacks polish language which I need :slight_smile:
EA seems much more flexible and has great potential (and I would like
to use it in a project I'm developing), but without support for a
given language it just won't do.
Also, who adds language support to EA, its developers or the
community?

Cheers,
Pawel

1 Like

Yes, you can hook your own analyzer, but you will need to implement a custom
class that provides it. Check for example the GermanAnalyzerProvider. What
is the name of the polish analyzer? I might have missed it and did not
include it out of the box.

2011/7/20 Paweł Konieczny koniecznypw@gmail.com

Hey!

Is it possible to import language analyzers from Lucene since ES is
built on top of it (and Lucene has definitely more languages supported
out of box)?
The list of languages supported by EA is pretty extensive too, but it
lacks polish language which I need :slight_smile:
EA seems much more flexible and has great potential (and I would like
to use it in a project I'm developing), but without support for a
given language it just won't do.
Also, who adds language support to EA, its developers or the
community?

Cheers,
Pawel

From what I understand, it's called Stempel and it's included in
Lucene.

On Jul 20, 6:50 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Yes, you can hook your own analyzer, but you will need to implement a custom
class that provides it. Check for example the GermanAnalyzerProvider. What
is the name of the polish analyzer? I might have missed it and did not
include it out of the box.

2011/7/20 Paweł Konieczny konieczn...@gmail.com

Hey!

Is it possible to import language analyzers from Lucene since ES is
built on top of it (and Lucene has definitely more languages supported
out of box)?
The list of languages supported by EA is pretty extensive too, but it
lacks polish language which I need :slight_smile:
EA seems much more flexible and has great potential (and I would like
to use it in a project I'm developing), but without support for a
given language it just won't do.
Also, who adds language support to EA, its developers or the
community?

Cheers,
Pawel

I think we should review all the available analyzers available, and identify
the missing ones (ie not wrapped in ES).

I also found that one on the Internet for Chinese:
http://code.google.com/p/ik-analyzer/
With an ES plugin (at least a stub):

But it's not part of Lucene-contrib.

Here is what I found, part of Lucene-contrib (apparently only those few *
language* analyzers are missing) :

Some other findings:

The most interesting lists come from Solr itself:

http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-summary.html

I didn't look very thoroughly at those last two links, but it looks that we
may be missing:

Towards a small easy pull-request?

[1] http://snowball.tartarus.org/algorithms/kraaij_pohlmann/stemmer.html
     Seen from

LanguageAnalysis - Solr - Apache Software Foundation
[2] Stemming - Wikipedia

Olivier Favre

www.yakaz.com

2011/7/21 Paweł Konieczny koniecznypw@gmail.com

From what I understand, it's called Stempel and it's included in
Lucene.

On Jul 20, 6:50 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Yes, you can hook your own analyzer, but you will need to implement a
custom
class that provides it. Check for example the GermanAnalyzerProvider.
What
is the name of the polish analyzer? I might have missed it and did not
include it out of the box.

2011/7/20 Paweł Konieczny konieczn...@gmail.com

Hey!

Is it possible to import language analyzers from Lucene since ES is
built on top of it (and Lucene has definitely more languages supported
out of box)?
The list of languages supported by EA is pretty extensive too, but it
lacks polish language which I need :slight_smile:
EA seems much more flexible and has great potential (and I would like
to use it in a project I'm developing), but without support for a
given language it just won't do.
Also, who adds language support to EA, its developers or the
community?

Cheers,
Pawel

Heya,

Yea, open issues for the missing analyzers, the stempel one, for example,
should be simple to add (its in a different lib).

On Thu, Jul 21, 2011 at 5:55 PM, Olivier Favre olivier@yakaz.com wrote:

I think we should review all the available analyzers available, and
identify the missing ones (ie not wrapped in ES).

I also found that one on the Internet for Chinese:
Google Code Archive - Long-term storage for Google Code Project Hosting.
With an ES plugin (at least a stub):
https://github.com/medcl/elasticsearch/blob/21abad12a0096173e8836dd042ca403751ab7ad1/plugins/analysis/ik/src/main/java/org/elasticsearch/index/analysis/IkAnalyzer.java
But it's not part of Lucene-contrib.

Here is what I found, part of Lucene-contrib (apparently only those few *
language* analyzers are missing) :

Some other findings:

The most interesting lists come from Solr itself:

http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-summary.html

I didn't look very thoroughly at those last two links, but it looks that we
may be missing:

Towards a small easy pull-request?

[1]

The Kraaij-Pohlmann stemming algorithm
Seen from
LanguageAnalysis - Solr - Apache Software Foundation
[2] Stemming - Wikipedia

Olivier Favre

www.yakaz.com

2011/7/21 Paweł Konieczny koniecznypw@gmail.com

From what I understand, it's called Stempel and it's included in
Lucene.

On Jul 20, 6:50 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Yes, you can hook your own analyzer, but you will need to implement a
custom
class that provides it. Check for example the GermanAnalyzerProvider.
What
is the name of the polish analyzer? I might have missed it and did not
include it out of the box.

2011/7/20 Paweł Konieczny konieczn...@gmail.com

Hey!

Is it possible to import language analyzers from Lucene since ES is
built on top of it (and Lucene has definitely more languages supported
out of box)?
The list of languages supported by EA is pretty extensive too, but it
lacks polish language which I need :slight_smile:
EA seems much more flexible and has great potential (and I would like
to use it in a project I'm developing), but without support for a
given language it just won't do.
Also, who adds language support to EA, its developers or the
community?

Cheers,
Pawel

So will the next release have them included out of box? I'm in no
hurry and I'd rather wait until someone does it properly.

Cheers,
Pawel

On Jul 22, 2:44 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

Yea, open issues for the missing analyzers, the stempel one, for example,
should be simple to add (its in a different lib).

On Thu, Jul 21, 2011 at 5:55 PM, Olivier Favre oliv...@yakaz.com wrote:

I think we should review all the available analyzers available, and
identify the missing ones (ie not wrapped in ES).

I also found that one on the Internet for Chinese:
Google Code Archive - Long-term storage for Google Code Project Hosting.
With an ES plugin (at least a stub):
GitHub - medcl/elasticsearch at 21abad12a0096173e8836dd042ca403751ab7ad1...
But it's not part of Lucene-contrib.

Here is what I found, part of Lucene-contrib (apparently only those few *
language* analyzers are missing) :

Some other findings:

The most interesting lists come from Solr itself:

http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-su...
-LanguageAnalysis - Solr - Apache Software Foundation

I didn't look very thoroughly at those last two links, but it looks that we
may be missing:

Towards a small easy pull-request?

[1]

The Kraaij-Pohlmann stemming algorithm
Seen from
LanguageAnalysis - Solr - Apache Software Foundation...
[2]Stemming - Wikipedia

Olivier Favre

www.yakaz.com

2011/7/21 Paweł Konieczny konieczn...@gmail.com

From what I understand, it's called Stempel and it's included in
Lucene.

On Jul 20, 6:50 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Yes, you can hook your own analyzer, but you will need to implement a
custom
class that provides it. Check for example the GermanAnalyzerProvider.
What
is the name of the polish analyzer? I might have missed it and did not
include it out of the box.

2011/7/20 Paweł Konieczny konieczn...@gmail.com

Hey!

Is it possible to import language analyzers from Lucene since ES is
built on top of it (and Lucene has definitely more languages supported
out of box)?
The list of languages supported by EA is pretty extensive too, but it
lacks polish language which I need :slight_smile:
EA seems much more flexible and has great potential (and I would like
to use it in a project I'm developing), but without support for a
given language it just won't do.
Also, who adds language support to EA, its developers or the
community?

Cheers,
Pawel

hey,i just write a post about how to customize an es plugin a few days
ago,but in chinese~ :frowning:

http://log.medcl.net/item/2011/07/diving-into-elasticsearch-3-编写自定义分词插件/

-----Original Message-----
From: PaweÂł Konieczny
Sent: Friday, July 22, 2011 3:52 PM
To: users
Subject: Re: Importing language analyzers

So will the next release have them included out of box? I'm in no
hurry and I'd rather wait until someone does it properly.

Cheers,
Pawel

On Jul 22, 2:44 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

Yea, open issues for the missing analyzers, the stempel one, for
example,
should be simple to add (its in a different lib).

On Thu, Jul 21, 2011 at 5:55 PM, Olivier Favre oliv...@yakaz.com wrote:

I think we should review all the available analyzers available, and
identify the missing ones (ie not wrapped in ES).

I also found that one on the Internet for Chinese:
Google Code Archive - Long-term storage for Google Code Project Hosting.
With an ES plugin (at least a stub):
GitHub - medcl/elasticsearch at 21abad12a0096173e8836dd042ca403751ab7ad1...
But it's not part of Lucene-contrib.

Here is what I found, part of Lucene-contrib (apparently only those few
*
language* analyzers are missing) :

Index of /__root/docs.lucene.apache.org/core/3_3_0/api/contrib-analyzers/org/apache...

Some other findings:

http://lucene.apache.org/java/3_3_0/api/contrib-analyzers/org/tartaru...

http://lucene.apache.org/java/3_3_0/api/contrib-analyzers/org/tartaru...

  • Wikipedia-syntax-aware tokenizer:

Index of /__root/docs.lucene.apache.org/core/3_3_0/api/contrib-analyzers/org/apache...

The most interesting lists come from Solr itself:

http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-su...
-LanguageAnalysis - Solr - Apache Software Foundation

I didn't look very thoroughly at those last two links, but it looks that
we
may be missing:

  • ClassicTokenizer (may be deprecated or superseded by the
    StandardTokenizer, I have no idea):

https://builds.apache.org/job/Lucene-3.x/javadoc/all/org/apache/lucen...

  • CommonGrams:

http://lucene.apache.org/solr/api/org/apache/solr/analysis/CommonGram...

Towards a small easy pull-request?

[1]

The Kraaij-Pohlmann stemming algorithm
Seen from
LanguageAnalysis - Solr - Apache Software Foundation...
[2]Stemming - Wikipedia

Olivier Favre

www.yakaz.com

2011/7/21 PaweÂł Konieczny konieczn...@gmail.com

From what I understand, it's called Stempel and it's included in
Lucene.

On Jul 20, 6:50 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Yes, you can hook your own analyzer, but you will need to implement a
custom
class that provides it. Check for example the GermanAnalyzerProvider.
What
is the name of the polish analyzer? I might have missed it and did
not
include it out of the box.

2011/7/20 PaweÂł Konieczny konieczn...@gmail.com

Hey!

Is it possible to import language analyzers from Lucene since ES
is
built on top of it (and Lucene has definitely more languages
supported
out of box)?
The list of languages supported by EA is pretty extensive too, but
it
lacks polish language which I need :slight_smile:
EA seems much more flexible and has great potential (and I would
like
to use it in a project I'm developing), but without support for a
given language it just won't do.
Also, who adds language support to EA, its developers or the
community?

Cheers,
Pawel

It can have it, sure, just make sure to open issues for the relevant
analyzers.

2011/7/22 Paweł Konieczny koniecznypw@gmail.com

So will the next release have them included out of box? I'm in no
hurry and I'd rather wait until someone does it properly.

Cheers,
Pawel

On Jul 22, 2:44 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

Yea, open issues for the missing analyzers, the stempel one, for
example,
should be simple to add (its in a different lib).

On Thu, Jul 21, 2011 at 5:55 PM, Olivier Favre oliv...@yakaz.com
wrote:

I think we should review all the available analyzers available, and
identify the missing ones (ie not wrapped in ES).

I also found that one on the Internet for Chinese:
Google Code Archive - Long-term storage for Google Code Project Hosting.
With an ES plugin (at least a stub):
GitHub - medcl/elasticsearch at 21abad12a0096173e8836dd042ca403751ab7ad1.
..
But it's not part of Lucene-contrib.

Here is what I found, part of Lucene-contrib (apparently only those few

language* analyzers are missing) :

Index of /__root/docs.lucene.apache.org/core/3_3_0/api/contrib-analyzers/org/apache...

Some other findings:

http://lucene.apache.org/java/3_3_0/api/contrib-analyzers/org/tartaru...

http://lucene.apache.org/java/3_3_0/api/contrib-analyzers/org/tartaru...

  • Wikipedia-syntax-aware tokenizer:

Index of /__root/docs.lucene.apache.org/core/3_3_0/api/contrib-analyzers/org/apache...

The most interesting lists come from Solr itself:

http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-su...

-LanguageAnalysis - Solr - Apache Software Foundation

I didn't look very thoroughly at those last two links, but it looks
that we
may be missing:

  • ClassicTokenizer (may be deprecated or superseded by the
    StandardTokenizer, I have no idea):

https://builds.apache.org/job/Lucene-3.x/javadoc/all/org/apache/lucen...

  • CommonGrams:

http://lucene.apache.org/solr/api/org/apache/solr/analysis/CommonGram...

  • Lao, Myanmar, Khmer - seem to only split in syllables:

LanguageAnalysis - Solr - Apache Software Foundation

Towards a small easy pull-request?

[1]

The Kraaij-Pohlmann stemming algorithm
Seen from
LanguageAnalysis - Solr - Apache Software Foundation.
..
[2]Stemming - Wikipedia

Olivier Favre

www.yakaz.com

2011/7/21 Paweł Konieczny konieczn...@gmail.com

From what I understand, it's called Stempel and it's included in
Lucene.

On Jul 20, 6:50 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Yes, you can hook your own analyzer, but you will need to implement
a
custom
class that provides it. Check for example the
GermanAnalyzerProvider.
What
is the name of the polish analyzer? I might have missed it and did
not
include it out of the box.

2011/7/20 Paweł Konieczny konieczn...@gmail.com

Hey!

Is it possible to import language analyzers from Lucene since ES
is
built on top of it (and Lucene has definitely more languages
supported
out of box)?
The list of languages supported by EA is pretty extensive too, but
it
lacks polish language which I need :slight_smile:
EA seems much more flexible and has great potential (and I would
like
to use it in a project I'm developing), but without support for a
given language it just won't do.
Also, who adds language support to EA, its developers or the
community?

Cheers,
Pawel