Use case of multiple Language Analyzer, Hunspell along with Elasticsearch Langdetect Plugin

Hi All,

We are having an ES cluster which is used to index large amount of data and that too with different languages. So as of now our current settings was pointing to English analyzer, and English hunspell but how we can achieve to index multilingual data along with Multi lingual analyzer and hunspell setup for same index (as I came across like there is a plugin called Elasticsearch Langdetect Plugin (https://github.com/jprante/elasticsearch-langdetect) available from ES 1.2.1).

Current analyzer setting is like:
index :
analysis :
analyzer :
synonym :
tokenizer : whitespace
filter : [synonym]
default_index :
type : custom
tokenizer : whitespace
filter : [ standard, lowercase,hunspell_US]
default_search :
type : custom
tokenizer : whitespace
filter : [standard, lowercase, synonym,hunspell_US]
filter :
synonym :
type : synonym
ignore_case : true
expand : true
synonyms_path : synonyms.txt
hunspell_US :
type : hunspell
locale : en_US
dedup : false
ignore_case : true

So here,

  1. Can we configure multilingual analyzer, hunspell for same index and then index data by configuring lang detect plugin for specific fields. So here whether data will be indexed and analyzed as per the language analyzer mentioned? And also will it be searchable as per multiple hunspell dictionaries and synonyms configured as well:

Confirm if below settings can be ued to achieve the same:

  analyzer :
    synonym : 
        tokenizer : whitespace
        filter : [synonym]
    default_index :
        type : custom
        tokenizer : whitespace
        filter : [ standard, lowercase,hunspell_US,hunspell_IN,hindi,english]  
    default_search :
        type : custom
        tokenizer : whitespace
        filter : [standard, lowercase, synonym,hunspell_US,hunspell_IN,hindi,english]  
  filter :
    hindi: 
      tokenizer:  standard
      filter: [lowercase]
    english: 
      tokenizer:  standard
      filter: [lowercase]
    synonym : 
        type : synonym
        ignore_case : true
        expand : true
        synonyms_path : synonyms.txt
    hunspell_US :
        type : hunspell
        locale : en_US 
        dedup : false
		ignore_case : true
    hunspell_IN :
        type : hunspell
        locale : hi_IN 
        dedup : false
		ignore_case : true

After that Say, I have configured lang detect plugin and indexed some data with different language English and hindi. So as I have configured multiple language analyzer, MultiLingual hunspell so will I be able to perform the index and search wrt different language as with different analyzer and get the data as per analyzed tokens for different languages.

Also whether synonym will also work with different languages?

~Prashant

Hi Jorg,

Can you help me out in this as I found you are the owner of lang detect plugin.

~Prashant

With langdetect plugin, there is ujst a field "lang" mapped under the
string field that is used for detection, and in this field the languages
codes are written. This is useful for e.g. aggregations or filtering
documents by language.

At the moment it is not possible to use something like for example a
dynamic "copy_to" to duplicate the field after detection to a field with
language-specific analyzer like a synonym analyzer.

A feature request at the issue tracker at github is much appreciated so I
can have a look into this.

Jörg

On Wed, Sep 24, 2014 at 12:57 PM, Prashant Agrawal <
prashant.agrawal@paladion.net> wrote:

Hi All,

We are having an ES cluster which is used to index large amount of data and
that too with different languages. So as of now our current settings was
pointing to English analyzer, and English hunspell but how we can achieve
to
index multilingual data along with Multi lingual analyzer and hunspell
setup
for same index (as I came across like there is a plugin called
Elasticsearch
Langdetect Plugin (https://github.com/jprante/elasticsearch-langdetect)
available from ES 1.2.1).

Current analyzer setting is like:
index :
analysis :
analyzer :
synonym :
tokenizer : whitespace
filter : [synonym]
default_index :
type : custom
tokenizer : whitespace
filter : [ standard, lowercase,hunspell_US]
default_search :
type : custom
tokenizer : whitespace
filter : [standard, lowercase, synonym,hunspell_US]
filter :
synonym :
type : synonym
ignore_case : true
expand : true
synonyms_path : synonyms.txt
hunspell_US :
type : hunspell
locale : en_US
dedup : false
ignore_case : true

So here,

  1. Can we configure multilingual analyzer, hunspell for same index and then
    index data by configuring lang detect plugin for specific fields. So here
    whether data will be indexed and analyzed as per the language analyzer
    mentioned? And also will it be searchable as per multiple hunspell
    dictionaries and synonyms configured as well:

Confirm if below settings can be ued to achieve the same:

  analyzer :
    synonym :
        tokenizer : whitespace
        filter : [synonym]
    default_index :
        type : custom
        tokenizer : whitespace
        filter : [ standard,

lowercase,hunspell_US,hunspell_IN,hindi,english]
default_search :
type : custom
tokenizer : whitespace
filter : [standard, lowercase,
synonym,hunspell_US,hunspell_IN,hindi,english]
filter :
hindi:
tokenizer: standard
filter: [lowercase]
english:
tokenizer: standard
filter: [lowercase]
synonym :
type : synonym
ignore_case : true
expand : true
synonyms_path : synonyms.txt
hunspell_US :
type : hunspell
locale : en_US
dedup : false
ignore_case : true
hunspell_IN :
type : hunspell
locale : hi_IN
dedup : false
ignore_case : true

After that Say, I have configured lang detect plugin and indexed some data
with different language English and hindi. So as I have configured multiple
language analyzer, MultiLingual hunspell so will I be able to perform the
index and search wrt different language as with different analyzer and get
the data as per analyzed tokens for different languages.

Also whether synonym will also work with different languages?

~Prashant

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Use-case-of-multiple-Language-Analyzer-Hunspell-along-with-Elasticsearch-Langdetect-Plugin-tp4063950.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1411556233114-4063950.post%40n3.nabble.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGZDhBwBRhNdF42QZg2wiJPTwzsryCdFMLc%2BfkBesZNeA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

The issue tracker address:

Jörg

On Wed, Sep 24, 2014 at 5:55 PM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

With langdetect plugin, there is ujst a field "lang" mapped under the
string field that is used for detection, and in this field the languages
codes are written. This is useful for e.g. aggregations or filtering
documents by language.

At the moment it is not possible to use something like for example a
dynamic "copy_to" to duplicate the field after detection to a field with
language-specific analyzer like a synonym analyzer.

A feature request at the issue tracker at github is much appreciated so I
can have a look into this.

Jörg

On Wed, Sep 24, 2014 at 12:57 PM, Prashant Agrawal <
prashant.agrawal@paladion.net> wrote:

Hi All,

We are having an ES cluster which is used to index large amount of data
and
that too with different languages. So as of now our current settings was
pointing to English analyzer, and English hunspell but how we can achieve
to
index multilingual data along with Multi lingual analyzer and hunspell
setup
for same index (as I came across like there is a plugin called
Elasticsearch
Langdetect Plugin (https://github.com/jprante/elasticsearch-langdetect)
available from ES 1.2.1).

Current analyzer setting is like:
index :
analysis :
analyzer :
synonym :
tokenizer : whitespace
filter : [synonym]
default_index :
type : custom
tokenizer : whitespace
filter : [ standard, lowercase,hunspell_US]
default_search :
type : custom
tokenizer : whitespace
filter : [standard, lowercase, synonym,hunspell_US]
filter :
synonym :
type : synonym
ignore_case : true
expand : true
synonyms_path : synonyms.txt
hunspell_US :
type : hunspell
locale : en_US
dedup : false
ignore_case : true

So here,

  1. Can we configure multilingual analyzer, hunspell for same index and
    then
    index data by configuring lang detect plugin for specific fields. So here
    whether data will be indexed and analyzed as per the language analyzer
    mentioned? And also will it be searchable as per multiple hunspell
    dictionaries and synonyms configured as well:

Confirm if below settings can be ued to achieve the same:

  analyzer :
    synonym :
        tokenizer : whitespace
        filter : [synonym]
    default_index :
        type : custom
        tokenizer : whitespace
        filter : [ standard,

lowercase,hunspell_US,hunspell_IN,hindi,english]
default_search :
type : custom
tokenizer : whitespace
filter : [standard, lowercase,
synonym,hunspell_US,hunspell_IN,hindi,english]
filter :
hindi:
tokenizer: standard
filter: [lowercase]
english:
tokenizer: standard
filter: [lowercase]
synonym :
type : synonym
ignore_case : true
expand : true
synonyms_path : synonyms.txt
hunspell_US :
type : hunspell
locale : en_US
dedup : false
ignore_case : true
hunspell_IN :
type : hunspell
locale : hi_IN
dedup : false
ignore_case : true

After that Say, I have configured lang detect plugin and indexed some data
with different language English and hindi. So as I have configured
multiple
language analyzer, MultiLingual hunspell so will I be able to perform the
index and search wrt different language as with different analyzer and get
the data as per analyzed tokens for different languages.

Also whether synonym will also work with different languages?

~Prashant

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Use-case-of-multiple-Language-Analyzer-Hunspell-along-with-Elasticsearch-Langdetect-Plugin-tp4063950.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1411556233114-4063950.post%40n3.nabble.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGEtcKLiA5oR6BRSvxdGGB7qmXx5iFoY-2dw2RHg%2BrX4g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

You can use langdetect plugin to identify the language of the document, and
use that document path to set _analyzer. _analyzer can be set dynamically
in that way, so the languages which are detected, analyzers with those
names should be existing in the system.

"my_index" : {
"_analyzer" : {
"path" : "lang_detect_field.lang"
},
"properties" : {
"lang_detect_field" : {
"type" : "langdetect",
"fields" : {
"lang_detect" : {
"type" : "string"
},
"lang" : {
"type" : "string"
}
}
}
}

On Wednesday, 24 September 2014 16:27:23 UTC+5:30, Prashy wrote:

Hi All,

We are having an ES cluster which is used to index large amount of data
and
that too with different languages. So as of now our current settings was
pointing to English analyzer, and English hunspell but how we can achieve
to
index multilingual data along with Multi lingual analyzer and hunspell
setup
for same index (as I came across like there is a plugin called
Elasticsearch
Langdetect Plugin (https://github.com/jprante/elasticsearch-langdetect)
available from ES 1.2.1).

Current analyzer setting is like:
index :
analysis :
analyzer :
synonym :
tokenizer : whitespace
filter : [synonym]
default_index :
type : custom
tokenizer : whitespace
filter : [ standard, lowercase,hunspell_US]
default_search :
type : custom
tokenizer : whitespace
filter : [standard, lowercase, synonym,hunspell_US]
filter :
synonym :
type : synonym
ignore_case : true
expand : true
synonyms_path : synonyms.txt
hunspell_US :
type : hunspell
locale : en_US
dedup : false
ignore_case : true

So here,

  1. Can we configure multilingual analyzer, hunspell for same index and
    then
    index data by configuring lang detect plugin for specific fields. So here
    whether data will be indexed and analyzed as per the language analyzer
    mentioned? And also will it be searchable as per multiple hunspell
    dictionaries and synonyms configured as well:

Confirm if below settings can be ued to achieve the same:

  analyzer : 
    synonym : 
        tokenizer : whitespace 
        filter : [synonym] 
    default_index : 
        type : custom 
        tokenizer : whitespace 
        filter : [ standard, 

lowercase,hunspell_US,hunspell_IN,hindi,english]
default_search :
type : custom
tokenizer : whitespace
filter : [standard, lowercase,
synonym,hunspell_US,hunspell_IN,hindi,english]
filter :
hindi:
tokenizer: standard
filter: [lowercase]
english:
tokenizer: standard
filter: [lowercase]
synonym :
type : synonym
ignore_case : true
expand : true
synonyms_path : synonyms.txt
hunspell_US :
type : hunspell
locale : en_US
dedup : false
ignore_case : true
hunspell_IN :
type : hunspell
locale : hi_IN
dedup : false
ignore_case : true

After that Say, I have configured lang detect plugin and indexed some data
with different language English and hindi. So as I have configured
multiple
language analyzer, MultiLingual hunspell so will I be able to perform the
index and search wrt different language as with different analyzer and get
the data as per analyzed tokens for different languages.

Also whether synonym will also work with different languages?

~Prashant

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Use-case-of-multiple-Language-Analyzer-Hunspell-along-with-Elasticsearch-Langdetect-Plugin-tp4063950.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ba265cf2-aaad-48b3-b454-930f9ac8a2cf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Wow, this works? Surprise....

Jörg

On Thu, Sep 25, 2014 at 10:20 AM, Nitin Maheshwari ask4nitin@gmail.com
wrote:

You can use langdetect plugin to identify the language of the document,
and use that document path to set _analyzer. _analyzer can be set
dynamically in that way, so the languages which are detected, analyzers
with those names should be existing in the system.

"my_index" : {
"_analyzer" : {
"path" : "lang_detect_field.lang"
},
"properties" : {
"lang_detect_field" : {
"type" : "langdetect",
"fields" : {
"lang_detect" : {
"type" : "string"
},
"lang" : {
"type" : "string"
}
}
}
}

On Wednesday, 24 September 2014 16:27:23 UTC+5:30, Prashy wrote:

Hi All,

We are having an ES cluster which is used to index large amount of data
and
that too with different languages. So as of now our current settings was
pointing to English analyzer, and English hunspell but how we can achieve
to
index multilingual data along with Multi lingual analyzer and hunspell
setup
for same index (as I came across like there is a plugin called
Elasticsearch
Langdetect Plugin (https://github.com/jprante/elasticsearch-langdetect)
available from ES 1.2.1).

Current analyzer setting is like:
index :
analysis :
analyzer :
synonym :
tokenizer : whitespace
filter : [synonym]
default_index :
type : custom
tokenizer : whitespace
filter : [ standard, lowercase,hunspell_US]
default_search :
type : custom
tokenizer : whitespace
filter : [standard, lowercase, synonym,hunspell_US]
filter :
synonym :
type : synonym
ignore_case : true
expand : true
synonyms_path : synonyms.txt
hunspell_US :
type : hunspell
locale : en_US
dedup : false
ignore_case : true

So here,

  1. Can we configure multilingual analyzer, hunspell for same index and
    then
    index data by configuring lang detect plugin for specific fields. So here
    whether data will be indexed and analyzed as per the language analyzer
    mentioned? And also will it be searchable as per multiple hunspell
    dictionaries and synonyms configured as well:

Confirm if below settings can be ued to achieve the same:

  analyzer :
    synonym :
        tokenizer : whitespace
        filter : [synonym]
    default_index :
        type : custom
        tokenizer : whitespace
        filter : [ standard,

lowercase,hunspell_US,hunspell_IN,hindi,english]
default_search :
type : custom
tokenizer : whitespace
filter : [standard, lowercase,
synonym,hunspell_US,hunspell_IN,hindi,english]
filter :
hindi:
tokenizer: standard
filter: [lowercase]
english:
tokenizer: standard
filter: [lowercase]
synonym :
type : synonym
ignore_case : true
expand : true
synonyms_path : synonyms.txt
hunspell_US :
type : hunspell
locale : en_US
dedup : false
ignore_case : true
hunspell_IN :
type : hunspell
locale : hi_IN
dedup : false
ignore_case : true

After that Say, I have configured lang detect plugin and indexed some
data
with different language English and hindi. So as I have configured
multiple
language analyzer, MultiLingual hunspell so will I be able to perform the
index and search wrt different language as with different analyzer and
get
the data as per analyzed tokens for different languages.

Also whether synonym will also work with different languages?

~Prashant

--
View this message in context: http://elasticsearch-users.
115913.n3.nabble.com/Use-case-of-multiple-Language-Analyzer-
Hunspell-along-with-Elasticsearch-Langdetect-Plugin-tp4063950.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/ba265cf2-aaad-48b3-b454-930f9ac8a2cf%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/ba265cf2-aaad-48b3-b454-930f9ac8a2cf%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFCYHDPF%2BKMa67HUJ5DzWh%3D%2BpUcVAgQat%2B2AEWPb0QECw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Yes, it works... :slight_smile:

But I am dealing with a different problem... I have a data set of around
200,000 records, It detected wrong language for 8000 records, may be the
text size was small. For most wrong case of lang detection it returned af.

Indexing that document is failing if the detected language based analyzer
is not defined in the system. I am trying to find how can i set a default
analyzer is the analyzer discovered does not exist.

On Thu, Sep 25, 2014 at 1:53 PM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

Wow, this works? Surprise....

Jörg

On Thu, Sep 25, 2014 at 10:20 AM, Nitin Maheshwari ask4nitin@gmail.com
wrote:

You can use langdetect plugin to identify the language of the document,
and use that document path to set _analyzer. _analyzer can be set
dynamically in that way, so the languages which are detected, analyzers
with those names should be existing in the system.

"my_index" : {
"_analyzer" : {
"path" : "lang_detect_field.lang"
},
"properties" : {
"lang_detect_field" : {
"type" : "langdetect",
"fields" : {
"lang_detect" : {
"type" : "string"
},
"lang" : {
"type" : "string"
}
}
}
}

On Wednesday, 24 September 2014 16:27:23 UTC+5:30, Prashy wrote:

Hi All,

We are having an ES cluster which is used to index large amount of data
and
that too with different languages. So as of now our current settings was
pointing to English analyzer, and English hunspell but how we can
achieve to
index multilingual data along with Multi lingual analyzer and hunspell
setup
for same index (as I came across like there is a plugin called
Elasticsearch
Langdetect Plugin (https://github.com/jprante/elasticsearch-langdetect)
available from ES 1.2.1).

Current analyzer setting is like:
index :
analysis :
analyzer :
synonym :
tokenizer : whitespace
filter : [synonym]
default_index :
type : custom
tokenizer : whitespace
filter : [ standard, lowercase,hunspell_US]
default_search :
type : custom
tokenizer : whitespace
filter : [standard, lowercase, synonym,hunspell_US]
filter :
synonym :
type : synonym
ignore_case : true
expand : true
synonyms_path : synonyms.txt
hunspell_US :
type : hunspell
locale : en_US
dedup : false
ignore_case : true

So here,

  1. Can we configure multilingual analyzer, hunspell for same index and
    then
    index data by configuring lang detect plugin for specific fields. So
    here
    whether data will be indexed and analyzed as per the language analyzer
    mentioned? And also will it be searchable as per multiple hunspell
    dictionaries and synonyms configured as well:

Confirm if below settings can be ued to achieve the same:

  analyzer :
    synonym :
        tokenizer : whitespace
        filter : [synonym]
    default_index :
        type : custom
        tokenizer : whitespace
        filter : [ standard,

lowercase,hunspell_US,hunspell_IN,hindi,english]
default_search :
type : custom
tokenizer : whitespace
filter : [standard, lowercase,
synonym,hunspell_US,hunspell_IN,hindi,english]
filter :
hindi:
tokenizer: standard
filter: [lowercase]
english:
tokenizer: standard
filter: [lowercase]
synonym :
type : synonym
ignore_case : true
expand : true
synonyms_path : synonyms.txt
hunspell_US :
type : hunspell
locale : en_US
dedup : false
ignore_case : true
hunspell_IN :
type : hunspell
locale : hi_IN
dedup : false
ignore_case : true

After that Say, I have configured lang detect plugin and indexed some
data
with different language English and hindi. So as I have configured
multiple
language analyzer, MultiLingual hunspell so will I be able to perform
the
index and search wrt different language as with different analyzer and
get
the data as per analyzed tokens for different languages.

Also whether synonym will also work with different languages?

~Prashant

--
View this message in context: http://elasticsearch-users.
115913.n3.nabble.com/Use-case-of-multiple-Language-Analyzer-
Hunspell-along-with-Elasticsearch-Langdetect-Plugin-tp4063950.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/ba265cf2-aaad-48b3-b454-930f9ac8a2cf%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/ba265cf2-aaad-48b3-b454-930f9ac8a2cf%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/BYE_y-2ni9I/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFCYHDPF%2BKMa67HUJ5DzWh%3D%2BpUcVAgQat%2B2AEWPb0QECw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFCYHDPF%2BKMa67HUJ5DzWh%3D%2BpUcVAgQat%2B2AEWPb0QECw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Nitin (Nits)
http://nitinmaheshwari.in

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHDjFTE1XS4DAkLvN0KegCifzR92JgL_h2VZMHYt27FVQmJBnw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Sure. In next release, I will add a new parameters, so there will be better
control:

  • the languages that can be detected (default: all)

  • if there is more than one language detected, how many language codes are
    indexed

  • threshold levels for successful detection

  • and, how many words must be in a field before detection is executed (or
    field length in characters). The number of words should be at least 3,
    otherwise detection is close to random.

Jörg

On Thu, Sep 25, 2014 at 10:36 AM, Nitin Maheshwari ask4nitin@gmail.com
wrote:

Yes, it works... :slight_smile:

But I am dealing with a different problem... I have a data set of around
200,000 records, It detected wrong language for 8000 records, may be the
text size was small. For most wrong case of lang detection it returned af.

Indexing that document is failing if the detected language based analyzer
is not defined in the system. I am trying to find how can i set a default
analyzer is the analyzer discovered does not exist.

On Thu, Sep 25, 2014 at 1:53 PM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

Wow, this works? Surprise....

Jörg

On Thu, Sep 25, 2014 at 10:20 AM, Nitin Maheshwari ask4nitin@gmail.com
wrote:

You can use langdetect plugin to identify the language of the document,
and use that document path to set _analyzer. _analyzer can be set
dynamically in that way, so the languages which are detected, analyzers
with those names should be existing in the system.

"my_index" : {
"_analyzer" : {
"path" : "lang_detect_field.lang"
},
"properties" : {
"lang_detect_field" : {
"type" : "langdetect",
"fields" : {
"lang_detect" : {
"type" : "string"
},
"lang" : {
"type" : "string"
}
}
}
}

On Wednesday, 24 September 2014 16:27:23 UTC+5:30, Prashy wrote:

Hi All,

We are having an ES cluster which is used to index large amount of data
and
that too with different languages. So as of now our current settings
was
pointing to English analyzer, and English hunspell but how we can
achieve to
index multilingual data along with Multi lingual analyzer and hunspell
setup
for same index (as I came across like there is a plugin called
Elasticsearch
Langdetect Plugin (https://github.com/jprante/elasticsearch-langdetect)

available from ES 1.2.1).

Current analyzer setting is like:
index :
analysis :
analyzer :
synonym :
tokenizer : whitespace
filter : [synonym]
default_index :
type : custom
tokenizer : whitespace
filter : [ standard, lowercase,hunspell_US]
default_search :
type : custom
tokenizer : whitespace
filter : [standard, lowercase, synonym,hunspell_US]
filter :
synonym :
type : synonym
ignore_case : true
expand : true
synonyms_path : synonyms.txt
hunspell_US :
type : hunspell
locale : en_US
dedup : false
ignore_case : true

So here,

  1. Can we configure multilingual analyzer, hunspell for same index and
    then
    index data by configuring lang detect plugin for specific fields. So
    here
    whether data will be indexed and analyzed as per the language analyzer
    mentioned? And also will it be searchable as per multiple hunspell
    dictionaries and synonyms configured as well:

Confirm if below settings can be ued to achieve the same:

  analyzer :
    synonym :
        tokenizer : whitespace
        filter : [synonym]
    default_index :
        type : custom
        tokenizer : whitespace
        filter : [ standard,

lowercase,hunspell_US,hunspell_IN,hindi,english]
default_search :
type : custom
tokenizer : whitespace
filter : [standard, lowercase,
synonym,hunspell_US,hunspell_IN,hindi,english]
filter :
hindi:
tokenizer: standard
filter: [lowercase]
english:
tokenizer: standard
filter: [lowercase]
synonym :
type : synonym
ignore_case : true
expand : true
synonyms_path : synonyms.txt
hunspell_US :
type : hunspell
locale : en_US
dedup : false
ignore_case : true
hunspell_IN :
type : hunspell
locale : hi_IN
dedup : false
ignore_case : true

After that Say, I have configured lang detect plugin and indexed some
data
with different language English and hindi. So as I have configured
multiple
language analyzer, MultiLingual hunspell so will I be able to perform
the
index and search wrt different language as with different analyzer and
get
the data as per analyzed tokens for different languages.

Also whether synonym will also work with different languages?

~Prashant

--
View this message in context: http://elasticsearch-users.
115913.n3.nabble.com/Use-case-of-multiple-Language-Analyzer-
Hunspell-along-with-Elasticsearch-Langdetect-Plugin-tp4063950.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/ba265cf2-aaad-48b3-b454-930f9ac8a2cf%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/ba265cf2-aaad-48b3-b454-930f9ac8a2cf%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/BYE_y-2ni9I/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFCYHDPF%2BKMa67HUJ5DzWh%3D%2BpUcVAgQat%2B2AEWPb0QECw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFCYHDPF%2BKMa67HUJ5DzWh%3D%2BpUcVAgQat%2B2AEWPb0QECw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Nitin (Nits)
http://nitinmaheshwari.in

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAHDjFTE1XS4DAkLvN0KegCifzR92JgL_h2VZMHYt27FVQmJBnw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAHDjFTE1XS4DAkLvN0KegCifzR92JgL_h2VZMHYt27FVQmJBnw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGza763fsvTbmXgY71DyzQEdJPO4nAyHqnuL-QEKXPosw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi Jörg/Nitin,

If I am not wrong then you are setting me to set the analyzer dynamically which I have setup Manually like below:
analyzer :
synonym :
tokenizer : whitespace
filter : [synonym]
default_index :
_analyzer :
path : lang_detect_field.lang
tokenizer : whitespace
filter : [ standard, lowercase,hunspell_US,hunspell_IN]
default_search :
_analyzer :
path : lang_detect_field.lang
tokenizer : whitespace
filter : [standard, lowercase, synonym,hunspell_US,hunspell_IN]
filter :
synonym :
type : synonym
ignore_case : true
expand : true
synonyms_path : synonyms.txt
hunspell_US :
type : hunspell
locale : en_US
dedup : false
ignore_case : true
hunspell_IN :
type : hunspell
locale : hi_IN
dedup : false
ignore_case : true .

Considering I have lang_detect_field in my mapping.

Correct me if I am wrong.

  1. Also what if my content is an attachment type, can I still use the same. As the content for indexing will be send as base64 encoded format? If yes how we can configure that as type will be an attachment for the same.

  2. If we can not achieve it using langdetect plugin then also can we have multiple language analyzer in our config as mentioned in first post, and will ES be able to recognise the same and perform indexing?

  3. @Jorg, as you are owner of hunspell plugin as well, So can you let me know if I can use multiple language hunspell configured for my index setting?

~Prashant

You can not use "_analyzer", which is a root mapping property, in an
analyzer definition.

My Hunspell plugin is kind of stalled, since there is hunspell support in
the core code:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/analysis-hunspell-tokenfilter.html

To be honest, it was quite an age ago when I was busy with hunspell, so I
can not answer your question instantly. I remember the results of hunspell
dictionaries used for stems were not satisfying. This was also due to a
poor hunspell dictionary reader. I hope the ES core code works better.

Because hunspell stemming is a token filter, you'd have to create a bunch
of custom analyzers with a hunspell token filter per language, and address
them via the "_analyzer" path method as shown above.

Jörg

On Thu, Sep 25, 2014 at 11:11 AM, Prashant Agrawal <
prashant.agrawal@paladion.net> wrote:

Hi Jörg/Nitin,

If I am not wrong then you are setting me to set the analyzer dynamically
which I have setup Manually like below:
analyzer :
synonym :
tokenizer : whitespace
filter : [synonym]
default_index :
_analyzer :
path : lang_detect_field.lang
tokenizer : whitespace
filter : [ standard, lowercase,hunspell_US,hunspell_IN]
default_search :
_analyzer :
path : lang_detect_field.lang
tokenizer : whitespace
filter : [standard, lowercase, synonym,hunspell_US,hunspell_IN]
filter :
synonym :
type : synonym
ignore_case : true
expand : true
synonyms_path : synonyms.txt
hunspell_US :
type : hunspell
locale : en_US
dedup : false
ignore_case : true
hunspell_IN :
type : hunspell
locale : hi_IN
dedup : false
ignore_case : true .

Considering I have lang_detect_field in my mapping.

Correct me if I am wrong.

  1. Also what if my content is an attachment type, can I still use the same.
    As the content for indexing will be send as base64 encoded format? If yes
    how we can configure that as type will be an attachment for the same.

  2. If we can not achieve it using langdetect plugin then also can we have
    multiple language analyzer in our config as mentioned in first post, and
    will ES be able to recognise the same and perform indexing?

  3. @Jorg, as you are owner of hunspell plugin as well, So can you let me
    know if I can use multiple language hunspell configured for my index
    setting?

~Prashant

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Use-case-of-multiple-Language-Analyzer-Hunspell-along-with-Elasticsearch-Langdetect-Plugin-tp4063950p4064005.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1411636317742-4064005.post%40n3.nabble.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE2wne8VQx0oLJOojqdcxZVmk3_axe5BHcDuPe2%2Bj6NrA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi Jorg,

What about this

  1. Also what if my content is an attachment type, can I still use the same. As the content for indexing will be send as base64 encoded format? If yes how we can configure that as type will be an attachment for the same.

  2. If we can not achieve it using langdetect plugin then also can we have multiple language analyzer in our config as mentioned in first post, and will ES be able to recognise the same and perform indexing?

  3. I hope the ES core code works better.
    Here what do you mean by ES core code, is there any specific settings which can be used for grammar based search ?

Also as I am not getting the clear picture (it got mixed somewhere with _analyzer) for analyzer stuff like how we can create multiple (different language) analyzer for same index. It would be great if you can give me a little demonstration which can be overwritten with analyzer setting in my first post (where I have used default_analyzer and default_search) to replace with two analyzer dynamically.

~Prashant

Langdetect works with binary content, e.g. from attachment mapper.

The other questions can not be answered quick. If I find time, I can post
something at gist.github.com

Jörg

On Thu, Sep 25, 2014 at 11:41 AM, Prashant Agrawal <
prashant.agrawal@paladion.net> wrote:

Hi Jorg,

What about this

  1. Also what if my content is an attachment type, can I still use the same.
    As the content for indexing will be send as base64 encoded format? If yes
    how we can configure that as type will be an attachment for the same.

  2. If we can not achieve it using langdetect plugin then also can we have
    multiple language analyzer in our config as mentioned in first post, and
    will ES be able to recognise the same and perform indexing?

  3. I hope the ES core code works better.
    Here what do you mean by ES core code, is there any specific settings which
    can be used for grammar based search ?

Also as I am not getting the clear picture (it got mixed somewhere with
_analyzer) for analyzer stuff like how we can create multiple (different
language) analyzer for same index. It would be great if you can give me a
little demonstration which can be overwritten with analyzer setting in my
first post (where I have used default_analyzer and default_search) to
replace with two analyzer dynamically.

~Prashant

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Use-case-of-multiple-Language-Analyzer-Hunspell-along-with-Elasticsearch-Langdetect-Plugin-tp4063950p4064009.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1411638089417-4064009.post%40n3.nabble.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFLgwg2AwQiuiXGD27L3DddhnzGcn5-1Z5mkXJp9zE_6A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Ok no problem.

so can you let me know if you post to any of query in gist.github.com.

Also I hope by using langdetect plugin we can analyze the different language content as well (using _analyzer) so a feature request for "a dynamic "copy_to" to duplicate the field after detection to a field with language-specific analyzer like a synonym analyzer." is not required now right ?