Indexing non-English text


(Andrei) #1

I have two questions that related to indexing non-English text.

  1. Does ES support accented character folding, i.e. indexing "café",
    but if the search term is "cafe" the doc is still found?

  2. If I understand correctly, the analyzers only support English text,
    so indexing Russian, German, etc won't work?

-Andrei


(Shay Banon) #2

On Tue, Aug 24, 2010 at 9:16 PM, Andrei andrei@zmievski.org wrote:

I have two questions that related to indexing non-English text.

  1. Does ES support accented character folding, i.e. indexing "café",
    but if the search term is "cafe" the doc is still found?

Yes, you can create your own analyzer and add to it the asciifolding filter.
The ICU plugin might also be interesting for this.

  1. If I understand correctly, the analyzers only support English text,
    so indexing Russian, German, etc won't work?

It depends how far you want to take it. There are specific analyzers for
different languages. I updated the docs to reflect that.

-Andrei


(James Cook) #3

We have to search text where Arabic and English are both used. I don't
foresee fields where Arabic and English are contained in the same document,
but we will definitely have many Arabic and English documents in our index.

Can someone provide configuration options for this scenario?

On Tue, Aug 24, 2010 at 7:45 PM, Shay Banon shay.banon@elasticsearch.comwrote:

On Tue, Aug 24, 2010 at 9:16 PM, Andrei andrei@zmievski.org wrote:

I have two questions that related to indexing non-English text.

  1. Does ES support accented character folding, i.e. indexing "café",
    but if the search term is "cafe" the doc is still found?

Yes, you can create your own analyzer and add to it the asciifolding
filter. The ICU plugin might also be interesting for this.

  1. If I understand correctly, the analyzers only support English text,
    so indexing Russian, German, etc won't work?

It depends how far you want to take it. There are specific analyzers for
different languages. I updated the docs to reflect that.

-Andrei


(Andrei) #4

On Aug 24, 4:45 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Yes, you can create your own analyzer and add to it the asciifolding filter.
The ICU plugin might also be interesting for this.

Do you mean to create one in Java or in the configuration file?

It depends how far you want to take it. There are specific analyzers for
different languages. I updated the docs to reflect that.

Could you link to the page that you updated? I couldn't find the
references to non-English languages there.

-Andrei


(Shay Banon) #5

On Wed, Aug 25, 2010 at 9:13 PM, Andrei andrei@zmievski.org wrote:

On Aug 24, 4:45 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Yes, you can create your own analyzer and add to it the asciifolding
filter.
The ICU plugin might also be interesting for this.

Do you mean to create one in Java or in the configuration file?

Its in a configuration file. You create a custom analyzer that include it.

It depends how far you want to take it. There are specific analyzers for
different languages. I updated the docs to reflect that.

Could you link to the page that you updated? I couldn't find the
references to non-English languages there.

Here it is:
http://www.elasticsearch.com/docs/elasticsearch/index_modules/analysis/analyzer/lang/

-Andrei


(James Cook) #6

Circling around to my earlier question, can I have an English and Arabic
analyzer specified on the same fields across documents?

On Wed, Aug 25, 2010 at 2:57 PM, Shay Banon shay.banon@elasticsearch.comwrote:

On Wed, Aug 25, 2010 at 9:13 PM, Andrei andrei@zmievski.org wrote:

On Aug 24, 4:45 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Yes, you can create your own analyzer and add to it the asciifolding
filter.
The ICU plugin might also be interesting for this.

Do you mean to create one in Java or in the configuration file?

Its in a configuration file. You create a custom analyzer that include it.

It depends how far you want to take it. There are specific analyzers for
different languages. I updated the docs to reflect that.

Could you link to the page that you updated? I couldn't find the
references to non-English languages there.

Here it is:
http://www.elasticsearch.com/docs/elasticsearch/index_modules/analysis/analyzer/lang/

-Andrei


(Shay Banon) #7

No, you can't specify different analyzers on the same field.

On Thu, Aug 26, 2010 at 4:31 PM, James Cook jcook@tracermedia.com wrote:

Circling around to my earlier question, can I have an English and Arabic
analyzer specified on the same fields across documents?

On Wed, Aug 25, 2010 at 2:57 PM, Shay Banon shay.banon@elasticsearch.comwrote:

On Wed, Aug 25, 2010 at 9:13 PM, Andrei andrei@zmievski.org wrote:

On Aug 24, 4:45 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Yes, you can create your own analyzer and add to it the asciifolding
filter.
The ICU plugin might also be interesting for this.

Do you mean to create one in Java or in the configuration file?

Its in a configuration file. You create a custom analyzer that include it.

It depends how far you want to take it. There are specific analyzers
for
different languages. I updated the docs to reflect that.

Could you link to the page that you updated? I couldn't find the
references to non-English languages there.

Here it is:
http://www.elasticsearch.com/docs/elasticsearch/index_modules/analysis/analyzer/lang/

-Andrei


(Clinton Gormley) #8

On Fri, 2010-08-27 at 13:33 +0300, Shay Banon wrote:

No, you can't specify different analyzers on the same field.

But you can index the same field twice, as a multi field, with different
analysers:

http://www.elasticsearch.com/docs/elasticsearch/mapping/multi_field/

clint

On Thu, Aug 26, 2010 at 4:31 PM, James Cook jcook@tracermedia.com
wrote:
Circling around to my earlier question, can I have an English
and Arabic analyzer specified on the same fields across
documents?

    On Wed, Aug 25, 2010 at 2:57 PM, Shay Banon
    <shay.banon@elasticsearch.com> wrote:
            On Wed, Aug 25, 2010 at 9:13 PM, Andrei
            <andrei@zmievski.org> wrote:
            
                    On Aug 24, 4:45 pm, Shay Banon
                    <shay.ba...@elasticsearch.com> wrote:
                    > Yes, you can create your own analyzer and
                    add to it the asciifolding filter.
                    > The ICU plugin might also be interesting for
                    this.
                    
                    
                    Do you mean to create one in Java or in the
                    configuration file?
            
            
            Its in a configuration file. You create a custom
            analyzer that include it.
             
                    
                    > It depends how far you want to take it.
                    There are specific analyzers for
                    > different languages. I updated the docs to
                    reflect that.
                    
                    
                    Could you link to the page that you updated? I
                    couldn't find the
                    references to non-English languages there.
            
            
            Here it
            is: http://www.elasticsearch.com/docs/elasticsearch/index_modules/analysis/analyzer/lang/
             
                    
                    -Andrei

--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.


(James Cook) #9

In my example, I have an object I am indexing which is similar to a
discussion thread. The 'content' proeprty will contain text which may be in
English or Arabic.

If the JSON document I am indexing can determine which language it is using,
can an analyzer be chosen at index and search time?

I don't know much about mappings yet, but the multi-type approach worries me
because the 'content' field will be knowingly indexed once with the correct
analyzer and once with the incorrect analyzer.

It appears from the doc entry that the query is then performed only against
the 'default' entry in the multi-type instead of applying against all
multi-type entries. This makes it a bit harder to manage queries I think. If
multi-type is the only way to be able to search for multilingual text in a
field, I suppose I will have to adapt. :slight_smile:

A quick search shows there are some analyzers out there that have been
developed for this problem. (i.e.
http://www.sematext.com/products/multilingual-indexer/index.html) In the
docs there is a list of built in analyzers. Is it straightforward to include
and configure other analyzers? Any pointers to docs?

Thanks

On Fri, Aug 27, 2010 at 7:34 AM, Clinton Gormley clinton@iannounce.co.ukwrote:

On Fri, 2010-08-27 at 13:33 +0300, Shay Banon wrote:

No, you can't specify different analyzers on the same field.

But you can index the same field twice, as a multi field, with different
analysers:

http://www.elasticsearch.com/docs/elasticsearch/mapping/multi_field/

clint

On Thu, Aug 26, 2010 at 4:31 PM, James Cook jcook@tracermedia.com
wrote:
Circling around to my earlier question, can I have an English
and Arabic analyzer specified on the same fields across
documents?

    On Wed, Aug 25, 2010 at 2:57 PM, Shay Banon
    <shay.banon@elasticsearch.com> wrote:
            On Wed, Aug 25, 2010 at 9:13 PM, Andrei
            <andrei@zmievski.org> wrote:

                    On Aug 24, 4:45 pm, Shay Banon
                    <shay.ba...@elasticsearch.com> wrote:
                    > Yes, you can create your own analyzer and
                    add to it the asciifolding filter.
                    > The ICU plugin might also be interesting for
                    this.


                    Do you mean to create one in Java or in the
                    configuration file?


            Its in a configuration file. You create a custom
            analyzer that include it.


                    > It depends how far you want to take it.
                    There are specific analyzers for
                    > different languages. I updated the docs to
                    reflect that.


                    Could you link to the page that you updated? I
                    couldn't find the
                    references to non-English languages there.


            Here it
            is:

http://www.elasticsearch.com/docs/elasticsearch/index_modules/analysis/analyzer/lang/

                    -Andrei

--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.


(Otis Gospodnetić) #10

Hello,

I spotted this reference to Sematext's Multilingual Indexer (MI):

A quick search shows there are some analyzers out there that have been
developed for this problem. (i.e.http://www.sematext.com/products/multilingual-indexer/index.html) In the
docs there is a list of built in analyzers. Is it straightforward to include
and configure other analyzers? Any pointers to docs?

Not sure if you are asking for MI docs or some other docs. MI comes
with good docs, but they are not public. Adding it to Solr is well
documented and easy to do. Let us know if you need it for Elastic
Search.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

On Aug 27, 8:57 am, James Cook jc...@tracermedia.com wrote:

In my example, I have an object I am indexing which is similar to a
discussion thread. The 'content' proeprty will contain text which may be in
English or Arabic.

If the JSON document I am indexing can determine which language it is using,
can an analyzer be chosen at index and search time?

I don't know much about mappings yet, but the multi-type approach worries me
because the 'content' field will be knowingly indexed once with the correct
analyzer and once with the incorrect analyzer.

It appears from the doc entry that the query is then performed only against
the 'default' entry in the multi-type instead of applying against all
multi-type entries. This makes it a bit harder to manage queries I think. If
multi-type is the only way to be able to search for multilingual text in a
field, I suppose I will have to adapt. :slight_smile:

A quick search shows there are some analyzers out there that have been
developed for this problem. (i.e.http://www.sematext.com/products/multilingual-indexer/index.html) In the
docs there is a list of built in analyzers. Is it straightforward to include
and configure other analyzers? Any pointers to docs?

Thanks

On Fri, Aug 27, 2010 at 7:34 AM, Clinton Gormley clin...@iannounce.co.ukwrote:

On Fri, 2010-08-27 at 13:33 +0300, Shay Banon wrote:

No, you can't specify different analyzers on the same field.

But you can index the same field twice, as a multi field, with different
analysers:

http://www.elasticsearch.com/docs/elasticsearch/mapping/multi_field/

clint

On Thu, Aug 26, 2010 at 4:31 PM, James Cook jc...@tracermedia.com
wrote:
Circling around to my earlier question, can I have an English
and Arabic analyzer specified on the same fields across
documents?

    On Wed, Aug 25, 2010 at 2:57 PM, Shay Banon
    <shay.ba...@elasticsearch.com> wrote:
            On Wed, Aug 25, 2010 at 9:13 PM, Andrei
            <and...@zmievski.org> wrote:
                    On Aug 24, 4:45 pm, Shay Banon
                    <shay.ba...@elasticsearch.com> wrote:
                    > Yes, you can create your own analyzer and
                    add to it the asciifolding filter.
                    > The ICU plugin might also be interesting for
                    this.
                    Do you mean to create one in Java or in the
                    configuration file?
            Its in a configuration file. You create a custom
            analyzer that include it.
                    > It depends how far you want to take it.
                    There are specific analyzers for
                    > different languages. I updated the docs to
                    reflect that.
                    Could you link to the page that you updated? I
                    couldn't find the
                    references to non-English languages there.
            Here it
            is:

http://www.elasticsearch.com/docs/elasticsearch/index_modules/analysi...

                    -Andrei

--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.


(James Cook) #11

Hi,

We are definitely in need of a way to search a single field for content that
can be in a variety of languages.

If that requires a product that needs to be licensed, then I am willing to
go down that road.

Please feel free to contact me at jcook at tracermedia dot c-o-m.

On Thu, Sep 9, 2010 at 7:47 PM, Otis otis.gospodnetic@gmail.com wrote:

Hello,

I spotted this reference to Sematext's Multilingual Indexer (MI):

A quick search shows there are some analyzers out there that have been
developed for this problem. (i.e.
http://www.sematext.com/products/multilingual-indexer/index.html) In the
docs there is a list of built in analyzers. Is it straightforward to
include
and configure other analyzers? Any pointers to docs?

Not sure if you are asking for MI docs or some other docs. MI comes
with good docs, but they are not public. Adding it to Solr is well
documented and easy to do. Let us know if you need it for Elastic
Search.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

On Aug 27, 8:57 am, James Cook jc...@tracermedia.com wrote:

In my example, I have an object I am indexing which is similar to a
discussion thread. The 'content' proeprty will contain text which may be
in
English or Arabic.

If the JSON document I am indexing can determine which language it is
using,
can an analyzer be chosen at index and search time?

I don't know much about mappings yet, but the multi-type approach worries
me
because the 'content' field will be knowingly indexed once with the
correct
analyzer and once with the incorrect analyzer.

It appears from the doc entry that the query is then performed only
against
the 'default' entry in the multi-type instead of applying against all
multi-type entries. This makes it a bit harder to manage queries I think.
If
multi-type is the only way to be able to search for multilingual text in
a
field, I suppose I will have to adapt. :slight_smile:

A quick search shows there are some analyzers out there that have been
developed for this problem. (i.e.
http://www.sematext.com/products/multilingual-indexer/index.html) In the
docs there is a list of built in analyzers. Is it straightforward to
include
and configure other analyzers? Any pointers to docs?

Thanks

On Fri, Aug 27, 2010 at 7:34 AM, Clinton Gormley <
clin...@iannounce.co.uk>wrote:

On Fri, 2010-08-27 at 13:33 +0300, Shay Banon wrote:

No, you can't specify different analyzers on the same field.

But you can index the same field twice, as a multi field, with
different

analysers:

http://www.elasticsearch.com/docs/elasticsearch/mapping/multi_field/

clint

On Thu, Aug 26, 2010 at 4:31 PM, James Cook jc...@tracermedia.com
wrote:
Circling around to my earlier question, can I have an English
and Arabic analyzer specified on the same fields across
documents?

    On Wed, Aug 25, 2010 at 2:57 PM, Shay Banon
    <shay.ba...@elasticsearch.com> wrote:
            On Wed, Aug 25, 2010 at 9:13 PM, Andrei
            <and...@zmievski.org> wrote:
                    On Aug 24, 4:45 pm, Shay Banon
                    <shay.ba...@elasticsearch.com> wrote:
                    > Yes, you can create your own analyzer and
                    add to it the asciifolding filter.
                    > The ICU plugin might also be interesting

for

                    this.
                    Do you mean to create one in Java or in the
                    configuration file?
            Its in a configuration file. You create a custom
            analyzer that include it.
                    > It depends how far you want to take it.
                    There are specific analyzers for
                    > different languages. I updated the docs to
                    reflect that.
                    Could you link to the page that you updated?

I

                    couldn't find the
                    references to non-English languages there.
            Here it
            is:

http://www.elasticsearch.com/docs/elasticsearch/index_modules/analysi.
..

                    -Andrei

--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.


(system) #12