Indexing non-English text

Andrei · August 24, 2010, 6:16pm

I have two questions that related to indexing non-English text.

Does ES support accented character folding, i.e. indexing "café",
but if the search term is "cafe" the doc is still found?
If I understand correctly, the analyzers only support English text,
so indexing Russian, German, etc won't work?

-Andrei

kimchy · August 24, 2010, 11:45pm

On Tue, Aug 24, 2010 at 9:16 PM, Andrei andrei@zmievski.org wrote:

I have two questions that related to indexing non-English text.

Does ES support accented character folding, i.e. indexing "café",
but if the search term is "cafe" the doc is still found?

Yes, you can create your own analyzer and add to it the asciifolding filter.
The ICU plugin might also be interesting for this.

If I understand correctly, the analyzers only support English text,
so indexing Russian, German, etc won't work?

It depends how far you want to take it. There are specific analyzers for
different languages. I updated the docs to reflect that.

-Andrei

James_Cook · August 25, 2010, 12:33pm

We have to search text where Arabic and English are both used. I don't
foresee fields where Arabic and English are contained in the same document,
but we will definitely have many Arabic and English documents in our index.

Can someone provide configuration options for this scenario?

On Tue, Aug 24, 2010 at 7:45 PM, Shay Banon shay.banon@elasticsearch.comwrote:

On Tue, Aug 24, 2010 at 9:16 PM, Andrei andrei@zmievski.org wrote:

I have two questions that related to indexing non-English text.

Does ES support accented character folding, i.e. indexing "café",
but if the search term is "cafe" the doc is still found?

Yes, you can create your own analyzer and add to it the asciifolding
filter. The ICU plugin might also be interesting for this.

If I understand correctly, the analyzers only support English text,
so indexing Russian, German, etc won't work?

It depends how far you want to take it. There are specific analyzers for
different languages. I updated the docs to reflect that.

-Andrei

Andrei · August 25, 2010, 6:13pm

On Aug 24, 4:45 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Yes, you can create your own analyzer and add to it the asciifolding filter.
The ICU plugin might also be interesting for this.

Do you mean to create one in Java or in the configuration file?

It depends how far you want to take it. There are specific analyzers for
different languages. I updated the docs to reflect that.

Could you link to the page that you updated? I couldn't find the
references to non-English languages there.

-Andrei

kimchy · August 25, 2010, 6:57pm

On Wed, Aug 25, 2010 at 9:13 PM, Andrei andrei@zmievski.org wrote:

On Aug 24, 4:45 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Yes, you can create your own analyzer and add to it the asciifolding
filter.
The ICU plugin might also be interesting for this.

Do you mean to create one in Java or in the configuration file?

Its in a configuration file. You create a custom analyzer that include it.

It depends how far you want to take it. There are specific analyzers for
different languages. I updated the docs to reflect that.

Could you link to the page that you updated? I couldn't find the
references to non-English languages there.

Here it is:
http://www.elasticsearch.com/docs/elasticsearch/index_modules/analysis/analyzer/lang/

-Andrei

James_Cook · August 26, 2010, 1:31pm

Circling around to my earlier question, can I have an English and Arabic
analyzer specified on the same fields across documents?

On Wed, Aug 25, 2010 at 2:57 PM, Shay Banon shay.banon@elasticsearch.comwrote:

On Wed, Aug 25, 2010 at 9:13 PM, Andrei andrei@zmievski.org wrote:

On Aug 24, 4:45 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Yes, you can create your own analyzer and add to it the asciifolding
filter.
The ICU plugin might also be interesting for this.

Do you mean to create one in Java or in the configuration file?

Its in a configuration file. You create a custom analyzer that include it.

It depends how far you want to take it. There are specific analyzers for
different languages. I updated the docs to reflect that.

Could you link to the page that you updated? I couldn't find the
references to non-English languages there.

Here it is:
http://www.elasticsearch.com/docs/elasticsearch/index_modules/analysis/analyzer/lang/

-Andrei

kimchy · August 27, 2010, 10:33am

No, you can't specify different analyzers on the same field.

On Thu, Aug 26, 2010 at 4:31 PM, James Cook jcook@tracermedia.com wrote:

Circling around to my earlier question, can I have an English and Arabic
analyzer specified on the same fields across documents?

On Wed, Aug 25, 2010 at 2:57 PM, Shay Banon shay.banon@elasticsearch.comwrote:

On Wed, Aug 25, 2010 at 9:13 PM, Andrei andrei@zmievski.org wrote:

On Aug 24, 4:45 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Yes, you can create your own analyzer and add to it the asciifolding
filter.
The ICU plugin might also be interesting for this.

Do you mean to create one in Java or in the configuration file?

Its in a configuration file. You create a custom analyzer that include it.

It depends how far you want to take it. There are specific analyzers
for
different languages. I updated the docs to reflect that.

Could you link to the page that you updated? I couldn't find the
references to non-English languages there.

Here it is:
http://www.elasticsearch.com/docs/elasticsearch/index_modules/analysis/analyzer/lang/

-Andrei

Clinton_Gormley · August 27, 2010, 11:34am

On Fri, 2010-08-27 at 13:33 +0300, Shay Banon wrote:

No, you can't specify different analyzers on the same field.

But you can index the same field twice, as a multi field, with different
analysers:

http://www.elasticsearch.com/docs/elasticsearch/mapping/multi_field/

clint

On Thu, Aug 26, 2010 at 4:31 PM, James Cook jcook@tracermedia.com
wrote:
Circling around to my earlier question, can I have an English
and Arabic analyzer specified on the same fields across
documents?

    On Wed, Aug 25, 2010 at 2:57 PM, Shay Banon
    <shay.banon@elasticsearch.com> wrote:
            On Wed, Aug 25, 2010 at 9:13 PM, Andrei
            <andrei@zmievski.org> wrote:
            
                    On Aug 24, 4:45 pm, Shay Banon
                    <shay.ba...@elasticsearch.com> wrote:
                    > Yes, you can create your own analyzer and
                    add to it the asciifolding filter.
                    > The ICU plugin might also be interesting for
                    this.
                    
                    
                    Do you mean to create one in Java or in the
                    configuration file?
            
            
            Its in a configuration file. You create a custom
            analyzer that include it.
             
                    
                    > It depends how far you want to take it.
                    There are specific analyzers for
                    > different languages. I updated the docs to
                    reflect that.
                    
                    
                    Could you link to the page that you updated? I
                    couldn't find the
                    references to non-English languages there.
            
            
            Here it
            is: http://www.elasticsearch.com/docs/elasticsearch/index_modules/analysis/analyzer/lang/
             
                    
                    -Andrei

--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.

James_Cook · August 27, 2010, 12:57pm

In my example, I have an object I am indexing which is similar to a
discussion thread. The 'content' proeprty will contain text which may be in
English or Arabic.

If the JSON document I am indexing can determine which language it is using,
can an analyzer be chosen at index and search time?

I don't know much about mappings yet, but the multi-type approach worries me
because the 'content' field will be knowingly indexed once with the correct
analyzer and once with the incorrect analyzer.

It appears from the doc entry that the query is then performed only against
the 'default' entry in the multi-type instead of applying against all
multi-type entries. This makes it a bit harder to manage queries I think. If
multi-type is the only way to be able to search for multilingual text in a
field, I suppose I will have to adapt.

A quick search shows there are some analyzers out there that have been
developed for this problem. (i.e.
Cloud Monitoring Tools & Services | Sematext) In the
docs there is a list of built in analyzers. Is it straightforward to include
and configure other analyzers? Any pointers to docs?

Thanks

On Fri, Aug 27, 2010 at 7:34 AM, Clinton Gormley clinton@iannounce.co.ukwrote:

On Fri, 2010-08-27 at 13:33 +0300, Shay Banon wrote:

No, you can't specify different analyzers on the same field.

But you can index the same field twice, as a multi field, with different
analysers:

http://www.elasticsearch.com/docs/elasticsearch/mapping/multi_field/

clint
On Thu, Aug 26, 2010 at 4:31 PM, James Cook jcook@tracermedia.com
wrote:
Circling around to my earlier question, can I have an English
and Arabic analyzer specified on the same fields across
documents?
    On Wed, Aug 25, 2010 at 2:57 PM, Shay Banon
    <shay.banon@elasticsearch.com> wrote:
            On Wed, Aug 25, 2010 at 9:13 PM, Andrei
            <andrei@zmievski.org> wrote:

                    On Aug 24, 4:45 pm, Shay Banon
                    <shay.ba...@elasticsearch.com> wrote:
                    > Yes, you can create your own analyzer and
                    add to it the asciifolding filter.
                    > The ICU plugin might also be interesting for
                    this.


                    Do you mean to create one in Java or in the
                    configuration file?


            Its in a configuration file. You create a custom
            analyzer that include it.


                    > It depends how far you want to take it.
                    There are specific analyzers for
                    > different languages. I updated the docs to
                    reflect that.


                    Could you link to the page that you updated? I
                    couldn't find the
                    references to non-English languages there.


            Here it
            is:
http://www.elasticsearch.com/docs/elasticsearch/index_modules/analysis/analyzer/lang/
                    -Andrei
--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.

otisg · September 9, 2010, 11:47pm

Hello,

I spotted this reference to Sematext's Multilingual Indexer (MI):

A quick search shows there are some analyzers out there that have been
developed for this problem. (i.e.Cloud Monitoring Tools & Services | Sematext) In the
docs there is a list of built in analyzers. Is it straightforward to include
and configure other analyzers? Any pointers to docs?

Not sure if you are asking for MI docs or some other docs. MI comes
with good docs, but they are not public. Adding it to Solr is well
documented and easy to do. Let us know if you need it for Elastic
Search.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

On Aug 27, 8:57 am, James Cook jc...@tracermedia.com wrote:

In my example, I have an object I am indexing which is similar to a
discussion thread. The 'content' proeprty will contain text which may be in
English or Arabic.

If the JSON document I am indexing can determine which language it is using,
can an analyzer be chosen at index and search time?

I don't know much about mappings yet, but the multi-type approach worries me
because the 'content' field will be knowingly indexed once with the correct
analyzer and once with the incorrect analyzer.

It appears from the doc entry that the query is then performed only against
the 'default' entry in the multi-type instead of applying against all
multi-type entries. This makes it a bit harder to manage queries I think. If
multi-type is the only way to be able to search for multilingual text in a
field, I suppose I will have to adapt.

A quick search shows there are some analyzers out there that have been
developed for this problem. (i.e.Cloud Monitoring Tools & Services | Sematext) In the
docs there is a list of built in analyzers. Is it straightforward to include
and configure other analyzers? Any pointers to docs?

Thanks

On Fri, Aug 27, 2010 at 7:34 AM, Clinton Gormley clin...@iannounce.co.ukwrote:

On Fri, 2010-08-27 at 13:33 +0300, Shay Banon wrote:

No, you can't specify different analyzers on the same field.

But you can index the same field twice, as a multi field, with different
analysers:

http://www.elasticsearch.com/docs/elasticsearch/mapping/multi_field/

clint

On Thu, Aug 26, 2010 at 4:31 PM, James Cook jc...@tracermedia.com
wrote:
Circling around to my earlier question, can I have an English
and Arabic analyzer specified on the same fields across
documents?
    On Wed, Aug 25, 2010 at 2:57 PM, Shay Banon
    <shay.ba...@elasticsearch.com> wrote:
            On Wed, Aug 25, 2010 at 9:13 PM, Andrei
            <and...@zmievski.org> wrote:
                    On Aug 24, 4:45 pm, Shay Banon
                    <shay.ba...@elasticsearch.com> wrote:
                    > Yes, you can create your own analyzer and
                    add to it the asciifolding filter.
                    > The ICU plugin might also be interesting for
                    this.
                    Do you mean to create one in Java or in the
                    configuration file?
            Its in a configuration file. You create a custom
            analyzer that include it.
                    > It depends how far you want to take it.
                    There are specific analyzers for
                    > different languages. I updated the docs to
                    reflect that.
                    Could you link to the page that you updated? I
                    couldn't find the
                    references to non-English languages there.
            Here it
            is:
http://www.elasticsearch.com/docs/elasticsearch/index_modules/analysi...
                    -Andrei
--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.

James_Cook · September 10, 2010, 5:13pm

Hi,

We are definitely in need of a way to search a single field for content that
can be in a variety of languages.

If that requires a product that needs to be licensed, then I am willing to
go down that road.

Please feel free to contact me at jcook at tracermedia dot c-o-m.

On Thu, Sep 9, 2010 at 7:47 PM, Otis otis.gospodnetic@gmail.com wrote:

Hello,

I spotted this reference to Sematext's Multilingual Indexer (MI):

A quick search shows there are some analyzers out there that have been
developed for this problem. (i.e.
Cloud Monitoring Tools & Services | Sematext) In the
docs there is a list of built in analyzers. Is it straightforward to
include
and configure other analyzers? Any pointers to docs?

Not sure if you are asking for MI docs or some other docs. MI comes
with good docs, but they are not public. Adding it to Solr is well
documented and easy to do. Let us know if you need it for Elastic
Search.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

On Aug 27, 8:57 am, James Cook jc...@tracermedia.com wrote:
In my example, I have an object I am indexing which is similar to a
discussion thread. The 'content' proeprty will contain text which may be
in
English or Arabic.

If the JSON document I am indexing can determine which language it is
using,
can an analyzer be chosen at index and search time?

I don't know much about mappings yet, but the multi-type approach worries
me
because the 'content' field will be knowingly indexed once with the
correct
analyzer and once with the incorrect analyzer.

It appears from the doc entry that the query is then performed only
against
the 'default' entry in the multi-type instead of applying against all
multi-type entries. This makes it a bit harder to manage queries I think.
If
multi-type is the only way to be able to search for multilingual text in
a
field, I suppose I will have to adapt.

A quick search shows there are some analyzers out there that have been
developed for this problem. (i.e.
Cloud Monitoring Tools & Services | Sematext) In the
docs there is a list of built in analyzers. Is it straightforward to
include
and configure other analyzers? Any pointers to docs?

Thanks

On Fri, Aug 27, 2010 at 7:34 AM, Clinton Gormley <
clin...@iannounce.co.uk>wrote:

On Fri, 2010-08-27 at 13:33 +0300, Shay Banon wrote:

No, you can't specify different analyzers on the same field.

But you can index the same field twice, as a multi field, with
different
analysers:

http://www.elasticsearch.com/docs/elasticsearch/mapping/multi_field/

clint

On Thu, Aug 26, 2010 at 4:31 PM, James Cook jc...@tracermedia.com
wrote:
Circling around to my earlier question, can I have an English
and Arabic analyzer specified on the same fields across
documents?
    On Wed, Aug 25, 2010 at 2:57 PM, Shay Banon
    <shay.ba...@elasticsearch.com> wrote:
            On Wed, Aug 25, 2010 at 9:13 PM, Andrei
            <and...@zmievski.org> wrote:
                    On Aug 24, 4:45 pm, Shay Banon
                    <shay.ba...@elasticsearch.com> wrote:
                    > Yes, you can create your own analyzer and
                    add to it the asciifolding filter.
                    > The ICU plugin might also be interesting
for
                    this.
                    Do you mean to create one in Java or in the
                    configuration file?
            Its in a configuration file. You create a custom
            analyzer that include it.
                    > It depends how far you want to take it.
                    There are specific analyzers for
                    > different languages. I updated the docs to
                    reflect that.
                    Could you link to the page that you updated?
I
                    couldn't find the
                    references to non-English languages there.
            Here it
            is:
http://www.elasticsearch.com/docs/elasticsearch/index_modules/analysi.
..
                    -Andrei
--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.

Topic		Replies	Views
Multi-lingual ES Elasticsearch	9	1182	July 6, 2017
Question about asciifolding filter Elasticsearch	3	549	July 6, 2017
Word with accent and searching Elasticsearch	5	1107	July 6, 2017
Folding of accented to non-accented only — leaving symbols Elasticsearch	2	340	July 6, 2017
Problem searching queries with accents Elasticsearch	10	13089	July 6, 2017

Indexing non-English text

Otis

Otis

Related topics