We are having an ES cluster which is used to index large amount of data and that too with different languages. So as of now our current settings was pointing to English analyzer, and English hunspell but how we can achieve to index multilingual data along with Multi lingual analyzer and hunspell setup for same index (as I came across like there is a plugin called Elasticsearch Langdetect Plugin (https://github.com/jprante/elasticsearch-langdetect) available from ES 1.2.1).
Can we configure multilingual analyzer, hunspell for same index and then index data by configuring lang detect plugin for specific fields. So here whether data will be indexed and analyzed as per the language analyzer mentioned? And also will it be searchable as per multiple hunspell dictionaries and synonyms configured as well:
Confirm if below settings can be ued to achieve the same:
After that Say, I have configured lang detect plugin and indexed some data with different language English and hindi. So as I have configured multiple language analyzer, MultiLingual hunspell so will I be able to perform the index and search wrt different language as with different analyzer and get the data as per analyzed tokens for different languages.
Also whether synonym will also work with different languages?
With langdetect plugin, there is ujst a field "lang" mapped under the
string field that is used for detection, and in this field the languages
codes are written. This is useful for e.g. aggregations or filtering
documents by language.
At the moment it is not possible to use something like for example a
dynamic "copy_to" to duplicate the field after detection to a field with
language-specific analyzer like a synonym analyzer.
A feature request at the issue tracker at github is much appreciated so I
can have a look into this.
We are having an ES cluster which is used to index large amount of data and
that too with different languages. So as of now our current settings was
pointing to English analyzer, and English hunspell but how we can achieve
to
index multilingual data along with Multi lingual analyzer and hunspell
setup
for same index (as I came across like there is a plugin called
Elasticsearch
Langdetect Plugin (GitHub - jprante/elasticsearch-langdetect: A plugin for language detection in Elasticsearch using Nakatani Shuyo's language detector)
available from ES 1.2.1).
Can we configure multilingual analyzer, hunspell for same index and then
index data by configuring lang detect plugin for specific fields. So here
whether data will be indexed and analyzed as per the language analyzer
mentioned? And also will it be searchable as per multiple hunspell
dictionaries and synonyms configured as well:
Confirm if below settings can be ued to achieve the same:
After that Say, I have configured lang detect plugin and indexed some data
with different language English and hindi. So as I have configured multiple
language analyzer, MultiLingual hunspell so will I be able to perform the
index and search wrt different language as with different analyzer and get
the data as per analyzed tokens for different languages.
Also whether synonym will also work with different languages?
With langdetect plugin, there is ujst a field "lang" mapped under the
string field that is used for detection, and in this field the languages
codes are written. This is useful for e.g. aggregations or filtering
documents by language.
At the moment it is not possible to use something like for example a
dynamic "copy_to" to duplicate the field after detection to a field with
language-specific analyzer like a synonym analyzer.
A feature request at the issue tracker at github is much appreciated so I
can have a look into this.
We are having an ES cluster which is used to index large amount of data
and
that too with different languages. So as of now our current settings was
pointing to English analyzer, and English hunspell but how we can achieve
to
index multilingual data along with Multi lingual analyzer and hunspell
setup
for same index (as I came across like there is a plugin called
Elasticsearch
Langdetect Plugin (GitHub - jprante/elasticsearch-langdetect: A plugin for language detection in Elasticsearch using Nakatani Shuyo's language detector)
available from ES 1.2.1).
Can we configure multilingual analyzer, hunspell for same index and
then
index data by configuring lang detect plugin for specific fields. So here
whether data will be indexed and analyzed as per the language analyzer
mentioned? And also will it be searchable as per multiple hunspell
dictionaries and synonyms configured as well:
Confirm if below settings can be ued to achieve the same:
After that Say, I have configured lang detect plugin and indexed some data
with different language English and hindi. So as I have configured
multiple
language analyzer, MultiLingual hunspell so will I be able to perform the
index and search wrt different language as with different analyzer and get
the data as per analyzed tokens for different languages.
Also whether synonym will also work with different languages?
You can use langdetect plugin to identify the language of the document, and
use that document path to set _analyzer. _analyzer can be set dynamically
in that way, so the languages which are detected, analyzers with those
names should be existing in the system.
On Wednesday, 24 September 2014 16:27:23 UTC+5:30, Prashy wrote:
Hi All,
We are having an ES cluster which is used to index large amount of data
and
that too with different languages. So as of now our current settings was
pointing to English analyzer, and English hunspell but how we can achieve
to
index multilingual data along with Multi lingual analyzer and hunspell
setup
for same index (as I came across like there is a plugin called
Elasticsearch
Langdetect Plugin (GitHub - jprante/elasticsearch-langdetect: A plugin for language detection in Elasticsearch using Nakatani Shuyo's language detector)
available from ES 1.2.1).
Can we configure multilingual analyzer, hunspell for same index and
then
index data by configuring lang detect plugin for specific fields. So here
whether data will be indexed and analyzed as per the language analyzer
mentioned? And also will it be searchable as per multiple hunspell
dictionaries and synonyms configured as well:
Confirm if below settings can be ued to achieve the same:
After that Say, I have configured lang detect plugin and indexed some data
with different language English and hindi. So as I have configured
multiple
language analyzer, MultiLingual hunspell so will I be able to perform the
index and search wrt different language as with different analyzer and get
the data as per analyzed tokens for different languages.
Also whether synonym will also work with different languages?
On Thu, Sep 25, 2014 at 10:20 AM, Nitin Maheshwari ask4nitin@gmail.com
wrote:
You can use langdetect plugin to identify the language of the document,
and use that document path to set _analyzer. _analyzer can be set
dynamically in that way, so the languages which are detected, analyzers
with those names should be existing in the system.
On Wednesday, 24 September 2014 16:27:23 UTC+5:30, Prashy wrote:
Hi All,
We are having an ES cluster which is used to index large amount of data
and
that too with different languages. So as of now our current settings was
pointing to English analyzer, and English hunspell but how we can achieve
to
index multilingual data along with Multi lingual analyzer and hunspell
setup
for same index (as I came across like there is a plugin called
Elasticsearch
Langdetect Plugin (GitHub - jprante/elasticsearch-langdetect: A plugin for language detection in Elasticsearch using Nakatani Shuyo's language detector)
available from ES 1.2.1).
Can we configure multilingual analyzer, hunspell for same index and
then
index data by configuring lang detect plugin for specific fields. So here
whether data will be indexed and analyzed as per the language analyzer
mentioned? And also will it be searchable as per multiple hunspell
dictionaries and synonyms configured as well:
Confirm if below settings can be ued to achieve the same:
After that Say, I have configured lang detect plugin and indexed some
data
with different language English and hindi. So as I have configured
multiple
language analyzer, MultiLingual hunspell so will I be able to perform the
index and search wrt different language as with different analyzer and
get
the data as per analyzed tokens for different languages.
Also whether synonym will also work with different languages?
But I am dealing with a different problem... I have a data set of around
200,000 records, It detected wrong language for 8000 records, may be the
text size was small. For most wrong case of lang detection it returned af.
Indexing that document is failing if the detected language based analyzer
is not defined in the system. I am trying to find how can i set a default
analyzer is the analyzer discovered does not exist.
On Thu, Sep 25, 2014 at 10:20 AM, Nitin Maheshwari ask4nitin@gmail.com
wrote:
You can use langdetect plugin to identify the language of the document,
and use that document path to set _analyzer. _analyzer can be set
dynamically in that way, so the languages which are detected, analyzers
with those names should be existing in the system.
On Wednesday, 24 September 2014 16:27:23 UTC+5:30, Prashy wrote:
Hi All,
We are having an ES cluster which is used to index large amount of data
and
that too with different languages. So as of now our current settings was
pointing to English analyzer, and English hunspell but how we can
achieve to
index multilingual data along with Multi lingual analyzer and hunspell
setup
for same index (as I came across like there is a plugin called
Elasticsearch
Langdetect Plugin (GitHub - jprante/elasticsearch-langdetect: A plugin for language detection in Elasticsearch using Nakatani Shuyo's language detector)
available from ES 1.2.1).
Can we configure multilingual analyzer, hunspell for same index and
then
index data by configuring lang detect plugin for specific fields. So
here
whether data will be indexed and analyzed as per the language analyzer
mentioned? And also will it be searchable as per multiple hunspell
dictionaries and synonyms configured as well:
Confirm if below settings can be ued to achieve the same:
After that Say, I have configured lang detect plugin and indexed some
data
with different language English and hindi. So as I have configured
multiple
language analyzer, MultiLingual hunspell so will I be able to perform
the
index and search wrt different language as with different analyzer and
get
the data as per analyzed tokens for different languages.
Also whether synonym will also work with different languages?
Sure. In next release, I will add a new parameters, so there will be better
control:
the languages that can be detected (default: all)
if there is more than one language detected, how many language codes are
indexed
threshold levels for successful detection
and, how many words must be in a field before detection is executed (or
field length in characters). The number of words should be at least 3,
otherwise detection is close to random.
Jörg
On Thu, Sep 25, 2014 at 10:36 AM, Nitin Maheshwari ask4nitin@gmail.com
wrote:
Yes, it works...
But I am dealing with a different problem... I have a data set of around
200,000 records, It detected wrong language for 8000 records, may be the
text size was small. For most wrong case of lang detection it returned af.
Indexing that document is failing if the detected language based analyzer
is not defined in the system. I am trying to find how can i set a default
analyzer is the analyzer discovered does not exist.
On Thu, Sep 25, 2014 at 10:20 AM, Nitin Maheshwari ask4nitin@gmail.com
wrote:
You can use langdetect plugin to identify the language of the document,
and use that document path to set _analyzer. _analyzer can be set
dynamically in that way, so the languages which are detected, analyzers
with those names should be existing in the system.
On Wednesday, 24 September 2014 16:27:23 UTC+5:30, Prashy wrote:
Hi All,
We are having an ES cluster which is used to index large amount of data
and
that too with different languages. So as of now our current settings
was
pointing to English analyzer, and English hunspell but how we can
achieve to
index multilingual data along with Multi lingual analyzer and hunspell
setup
for same index (as I came across like there is a plugin called
Elasticsearch
Langdetect Plugin (GitHub - jprante/elasticsearch-langdetect: A plugin for language detection in Elasticsearch using Nakatani Shuyo's language detector)
Can we configure multilingual analyzer, hunspell for same index and
then
index data by configuring lang detect plugin for specific fields. So
here
whether data will be indexed and analyzed as per the language analyzer
mentioned? And also will it be searchable as per multiple hunspell
dictionaries and synonyms configured as well:
Confirm if below settings can be ued to achieve the same:
After that Say, I have configured lang detect plugin and indexed some
data
with different language English and hindi. So as I have configured
multiple
language analyzer, MultiLingual hunspell so will I be able to perform
the
index and search wrt different language as with different analyzer and
get
the data as per analyzed tokens for different languages.
Also whether synonym will also work with different languages?
If I am not wrong then you are setting me to set the analyzer dynamically which I have setup Manually like below:
analyzer :
synonym :
tokenizer : whitespace
filter : [synonym]
default_index :
_analyzer :
path : lang_detect_field.lang
tokenizer : whitespace
filter : [ standard, lowercase,hunspell_US,hunspell_IN]
default_search :
_analyzer :
path : lang_detect_field.lang
tokenizer : whitespace
filter : [standard, lowercase, synonym,hunspell_US,hunspell_IN]
filter :
synonym :
type : synonym
ignore_case : true
expand : true
synonyms_path : synonyms.txt
hunspell_US :
type : hunspell
locale : en_US
dedup : false
ignore_case : true
hunspell_IN :
type : hunspell
locale : hi_IN
dedup : false
ignore_case : true .
Considering I have lang_detect_field in my mapping.
Correct me if I am wrong.
Also what if my content is an attachment type, can I still use the same. As the content for indexing will be send as base64 encoded format? If yes how we can configure that as type will be an attachment for the same.
If we can not achieve it using langdetect plugin then also can we have multiple language analyzer in our config as mentioned in first post, and will ES be able to recognise the same and perform indexing?
@Jorg, as you are owner of hunspell plugin as well, So can you let me know if I can use multiple language hunspell configured for my index setting?
You can not use "_analyzer", which is a root mapping property, in an
analyzer definition.
My Hunspell plugin is kind of stalled, since there is hunspell support in
the core code:
To be honest, it was quite an age ago when I was busy with hunspell, so I
can not answer your question instantly. I remember the results of hunspell
dictionaries used for stems were not satisfying. This was also due to a
poor hunspell dictionary reader. I hope the ES core code works better.
Because hunspell stemming is a token filter, you'd have to create a bunch
of custom analyzers with a hunspell token filter per language, and address
them via the "_analyzer" path method as shown above.
If I am not wrong then you are setting me to set the analyzer dynamically
which I have setup Manually like below:
analyzer :
synonym :
tokenizer : whitespace
filter : [synonym]
default_index :
_analyzer :
path : lang_detect_field.lang
tokenizer : whitespace
filter : [ standard, lowercase,hunspell_US,hunspell_IN]
default_search :
_analyzer :
path : lang_detect_field.lang
tokenizer : whitespace
filter : [standard, lowercase, synonym,hunspell_US,hunspell_IN]
filter :
synonym :
type : synonym
ignore_case : true
expand : true
synonyms_path : synonyms.txt
hunspell_US :
type : hunspell
locale : en_US
dedup : false
ignore_case : true
hunspell_IN :
type : hunspell
locale : hi_IN
dedup : false
ignore_case : true .
Considering I have lang_detect_field in my mapping.
Correct me if I am wrong.
Also what if my content is an attachment type, can I still use the same.
As the content for indexing will be send as base64 encoded format? If yes
how we can configure that as type will be an attachment for the same.
If we can not achieve it using langdetect plugin then also can we have
multiple language analyzer in our config as mentioned in first post, and
will ES be able to recognise the same and perform indexing?
@Jorg, as you are owner of hunspell plugin as well, So can you let me
know if I can use multiple language hunspell configured for my index
setting?
Also what if my content is an attachment type, can I still use the same. As the content for indexing will be send as base64 encoded format? If yes how we can configure that as type will be an attachment for the same.
If we can not achieve it using langdetect plugin then also can we have multiple language analyzer in our config as mentioned in first post, and will ES be able to recognise the same and perform indexing?
I hope the ES core code works better.
Here what do you mean by ES core code, is there any specific settings which can be used for grammar based search ?
Also as I am not getting the clear picture (it got mixed somewhere with _analyzer) for analyzer stuff like how we can create multiple (different language) analyzer for same index. It would be great if you can give me a little demonstration which can be overwritten with analyzer setting in my first post (where I have used default_analyzer and default_search) to replace with two analyzer dynamically.
Also what if my content is an attachment type, can I still use the same.
As the content for indexing will be send as base64 encoded format? If yes
how we can configure that as type will be an attachment for the same.
If we can not achieve it using langdetect plugin then also can we have
multiple language analyzer in our config as mentioned in first post, and
will ES be able to recognise the same and perform indexing?
I hope the ES core code works better.
Here what do you mean by ES core code, is there any specific settings which
can be used for grammar based search ?
Also as I am not getting the clear picture (it got mixed somewhere with
_analyzer) for analyzer stuff like how we can create multiple (different
language) analyzer for same index. It would be great if you can give me a
little demonstration which can be overwritten with analyzer setting in my
first post (where I have used default_analyzer and default_search) to
replace with two analyzer dynamically.
so can you let me know if you post to any of query in gist.github.com.
Also I hope by using langdetect plugin we can analyze the different language content as well (using _analyzer) so a feature request for "a dynamic "copy_to" to duplicate the field after detection to a field with language-specific analyzer like a synonym analyzer." is not required now right ?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.