Stemming Capability for English+Arabic Content

tarang_dawer · June 12, 2013, 7:25am

Hi
I am indexing some data which is a mixture of arabic and english content.

Since the content is huge in size , thus, to avoid storage overhead ,
multi-field option is not prefrable (1 field with arabic analyzer and 2nd
with snowball analyzer) .

So , does any analyzer (snowball, arabic , or any other ) has the
capability to stem both arabic and english ?

Please Help

Thanks
Tarang Dawer

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

tarang_dawer · June 13, 2013, 6:40am

Hi
If i configure a custom analyzer with filters :- arabic stopword, english
stopword, arabic stemmer , and english stemmer , will the respective
language filters not interfere with the tokens of other language ?

Also, Will standard tokenizer work fine in this case ? (arabic analyzer des
not have tokenizer type in the documentation)

Could somebody please help me resolve the issue ?

Thanks
Tarang Dawer

On Wed, Jun 12, 2013 at 12:55 PM, Tarang Dawer tarang.dawer@gmail.comwrote:

Hi
I am indexing some data which is a mixture of arabic and english content.

Since the content is huge in size , thus, to avoid storage overhead ,
multi-field option is not prefrable (1 field with arabic analyzer and 2nd
with snowball analyzer) .

So , does any analyzer (snowball, arabic , or any other ) has the
capability to stem both arabic and english ?

Please Help

Thanks
Tarang Dawer

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

dp_desoma · June 13, 2013, 1:44pm

have you tried the following plugin?

haven't used it myself, but looks promising.

I'm no expert with stemming, but in my understanding I'd say: as for
combining multiple stemmers in one analyzer wouldn't be a good idea, as
they would, as you assumed, interfere with each other. You can check if the
output of the analyzer would be good at all if you just define one and use
it via the REST interface.

if you mean the language analyzer for arabic, this one is, as stated in the
documentation, already optimized for arabic language, thus you can't define
the tokenizer. if you create your own custom analyzer with arabic stemmer
then you can provide your own tokenizer, but you have to know which one to
use to get the best results.

hth
david

On Thursday, June 13, 2013 8:40:13 AM UTC+2, tarang dawer wrote:

Hi
If i configure a custom analyzer with filters :- arabic stopword, english
stopword, arabic stemmer , and english stemmer , will the respective
language filters not interfere with the tokens of other language ?

Also, Will standard tokenizer work fine in this case ? (arabic analyzer
des not have tokenizer type in the documentation)

Could somebody please help me resolve the issue ?

Thanks
Tarang Dawer

On Wed, Jun 12, 2013 at 12:55 PM, Tarang Dawer <tarang...@gmail.com<javascript:>

wrote:

Hi
I am indexing some data which is a mixture of arabic and english content.

Since the content is huge in size , thus, to avoid storage overhead ,
multi-field option is not prefrable (1 field with arabic analyzer and 2nd
with snowball analyzer) .

So , does any analyzer (snowball, arabic , or any other ) has the
capability to stem both arabic and english ?

Please Help

Thanks
Tarang Dawer

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

dp_desoma · June 13, 2013, 1:46pm

have you tried the following plugin?

haven't used it myself, but looks promising.

I'm no expert with stemming, but in my understanding I'd say: as for
combining multiple stemmers in one analyzer wouldn't be a good idea, as
they would, as you assumed, interfere with each other. You can check if the
output of the analyzer would be good at all if you just define one and use
it via the REST interface.

if you mean the language analyzer for arabic, this one is, as stated in the
documentation, already optimized for arabic language, thus you can't define
the tokenizer. if you create your own custom analyzer with arabic stemmer
then you can provide your own tokenizer, but you have to know which one to
use to get the best results.

and, a stemmer is for one language only

hth
david

On Wednesday, June 12, 2013 9:25:13 AM UTC+2, tarang dawer wrote:

Hi
I am indexing some data which is a mixture of arabic and english content.

Since the content is huge in size , thus, to avoid storage overhead ,
multi-field option is not prefrable (1 field with arabic analyzer and 2nd
with snowball analyzer) .

So , does any analyzer (snowball, arabic , or any other ) has the
capability to stem both arabic and english ?

Please Help

Thanks
Tarang Dawer

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

tarang_dawer · June 17, 2013, 7:15am

Hi
Thanks for you reply David.

Arabic analyzer also uses the standard tokenizer , so wouldn't configuring
a custom analyzer with standard tokenizer , and having following filters :

1. English Stemmer 2.English Stop Words 1. Arabic Stemmer 4. Arabic Stop
  Words
  , be enough ?
  (i am thinking like , an english word would not be found in arabic stop
  words list , as well as , arabic stemmer would not be able to extract the
  root from it. , and thus vice versa for the arabic word and english filters
  ) ?

On Thu, Jun 13, 2013 at 7:16 PM, dp.desoma@gmail.com wrote:

have you tried the following plugin? https://github.com/**
yakaz/elasticsearch-analysis-**combohttps://github.com/yakaz/elasticsearch-analysis-combo
haven't used it myself, but looks promising.

I'm no expert with stemming, but in my understanding I'd say: as for
combining multiple stemmers in one analyzer wouldn't be a good idea, as
they would, as you assumed, interfere with each other. You can check if the
output of the analyzer would be good at all if you just define one and use
it via the REST interface.

if you mean the language analyzer for arabic, this one is, as stated in
the documentation, already optimized for arabic language, thus you can't
define the tokenizer. if you create your own custom analyzer with arabic
stemmer then you can provide your own tokenizer, but you have to know which
one to use to get the best results.

and, a stemmer is for one language only

hth
david

On Wednesday, June 12, 2013 9:25:13 AM UTC+2, tarang dawer wrote:

Hi
I am indexing some data which is a mixture of arabic and english content.

Since the content is huge in size , thus, to avoid storage overhead ,
multi-field option is not prefrable (1 field with arabic analyzer and 2nd
with snowball analyzer) .

So , does any analyzer (snowball, arabic , or any other ) has the
capability to stem both arabic and english ?

Please Help

Thanks
Tarang Dawer

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Itamar_Syn_Hershko · June 17, 2013, 7:19am

Combining 2 languages like Arabic and English is doable, since the Unicode
code points always differ and you don't need to worry about one analyzer
messing with the results of another, and the work of the 2 analyzers is
pretty much orthogonal. Any other 2 languages I would say don't do that,
but English and Arabic should work well.

And don't use stop words

On Mon, Jun 17, 2013 at 10:15 AM, Tarang Dawer tarang.dawer@gmail.comwrote:

Hi
Thanks for you reply David.

Arabic analyzer also uses the standard tokenizer , so wouldn't configuring
a custom analyzer with standard tokenizer , and having following filters :

English Stemmer 2.English Stop Words 1. Arabic Stemmer 4. Arabic Stop
Words
, be enough ?
(i am thinking like , an english word would not be found in arabic stop
words list , as well as , arabic stemmer would not be able to extract the
root from it. , and thus vice versa for the arabic word and english filters
) ?

On Thu, Jun 13, 2013 at 7:16 PM, dp.desoma@gmail.com wrote:

have you tried the following plugin? https://github.com/**
yakaz/elasticsearch-analysis-**combohttps://github.com/yakaz/elasticsearch-analysis-combo
haven't used it myself, but looks promising.

I'm no expert with stemming, but in my understanding I'd say: as for
combining multiple stemmers in one analyzer wouldn't be a good idea, as
they would, as you assumed, interfere with each other. You can check if the
output of the analyzer would be good at all if you just define one and use
it via the REST interface.

if you mean the language analyzer for arabic, this one is, as stated in
the documentation, already optimized for arabic language, thus you can't
define the tokenizer. if you create your own custom analyzer with arabic
stemmer then you can provide your own tokenizer, but you have to know which
one to use to get the best results.

and, a stemmer is for one language only

hth
david

On Wednesday, June 12, 2013 9:25:13 AM UTC+2, tarang dawer wrote:

Hi
I am indexing some data which is a mixture of arabic and english content.

Since the content is huge in size , thus, to avoid storage overhead ,
multi-field option is not prefrable (1 field with arabic analyzer and 2nd
with snowball analyzer) .

So , does any analyzer (snowball, arabic , or any other ) has the
capability to stem both arabic and english ?

Please Help

Thanks
Tarang Dawer

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

tarang_dawer · June 17, 2013, 7:32am

Thanks for your reply Itamar .

Very helpful certainly.
But , why not to use stop words filter ?
( for arabic only or for both english as well as arabic ? )

Thanks
Tarang Dawer

On Mon, Jun 17, 2013 at 12:49 PM, Itamar Syn-Hershko itamar@code972.comwrote:

Combining 2 languages like Arabic and English is doable, since the Unicode
code points always differ and you don't need to worry about one analyzer
messing with the results of another, and the work of the 2 analyzers is
pretty much orthogonal. Any other 2 languages I would say don't do that,
but English and Arabic should work well.

And don't use stop words

On Mon, Jun 17, 2013 at 10:15 AM, Tarang Dawer tarang.dawer@gmail.comwrote:

Hi
Thanks for you reply David.

Arabic analyzer also uses the standard tokenizer , so wouldn't
configuring a custom analyzer with standard tokenizer , and having
following filters : - 1. English Stemmer 2.English Stop Words 1. Arabic
Stemmer 4. Arabic Stop Words
, be enough ?
(i am thinking like , an english word would not be found in arabic stop
words list , as well as , arabic stemmer would not be able to extract the
root from it. , and thus vice versa for the arabic word and english filters
) ?

On Thu, Jun 13, 2013 at 7:16 PM, dp.desoma@gmail.com wrote:

have you tried the following plugin? https://github.com/**
yakaz/elasticsearch-analysis-**combohttps://github.com/yakaz/elasticsearch-analysis-combo
haven't used it myself, but looks promising.

I'm no expert with stemming, but in my understanding I'd say: as for
combining multiple stemmers in one analyzer wouldn't be a good idea, as
they would, as you assumed, interfere with each other. You can check if the
output of the analyzer would be good at all if you just define one and use
it via the REST interface.

if you mean the language analyzer for arabic, this one is, as stated in
the documentation, already optimized for arabic language, thus you can't
define the tokenizer. if you create your own custom analyzer with arabic
stemmer then you can provide your own tokenizer, but you have to know which
one to use to get the best results.

and, a stemmer is for one language only

hth
david

On Wednesday, June 12, 2013 9:25:13 AM UTC+2, tarang dawer wrote:

Hi
I am indexing some data which is a mixture of arabic and english
content.

Since the content is huge in size , thus, to avoid storage overhead ,
multi-field option is not prefrable (1 field with arabic analyzer and 2nd
with snowball analyzer) .

So , does any analyzer (snowball, arabic , or any other ) has the
capability to stem both arabic and english ?

Please Help

Thanks
Tarang Dawer

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Itamar_Syn_Hershko · June 17, 2013, 7:33am

For both English and Arabic. Because today there are better ways to deal
with the problems stop words pose - for example by using common grams or
common terms query

On Mon, Jun 17, 2013 at 10:32 AM, Tarang Dawer tarang.dawer@gmail.comwrote:

Thanks for your reply Itamar .

Very helpful certainly.
But , why not to use stop words filter ?
( for arabic only or for both english as well as arabic ? )

Thanks
Tarang Dawer

On Mon, Jun 17, 2013 at 12:49 PM, Itamar Syn-Hershko itamar@code972.comwrote:

Combining 2 languages like Arabic and English is doable, since the
Unicode code points always differ and you don't need to worry about one
analyzer messing with the results of another, and the work of the 2
analyzers is pretty much orthogonal. Any other 2 languages I would say
don't do that, but English and Arabic should work well.

And don't use stop words

On Mon, Jun 17, 2013 at 10:15 AM, Tarang Dawer tarang.dawer@gmail.comwrote:

Hi
Thanks for you reply David.

Arabic analyzer also uses the standard tokenizer , so wouldn't
configuring a custom analyzer with standard tokenizer , and having
following filters : - 1. English Stemmer 2.English Stop Words 1. Arabic
Stemmer 4. Arabic Stop Words
, be enough ?
(i am thinking like , an english word would not be found in arabic stop
words list , as well as , arabic stemmer would not be able to extract the
root from it. , and thus vice versa for the arabic word and english filters
) ?

On Thu, Jun 13, 2013 at 7:16 PM, dp.desoma@gmail.com wrote:

have you tried the following plugin? https://github.com/**
yakaz/elasticsearch-analysis-**combohttps://github.com/yakaz/elasticsearch-analysis-combo
haven't used it myself, but looks promising.

I'm no expert with stemming, but in my understanding I'd say: as for
combining multiple stemmers in one analyzer wouldn't be a good idea, as
they would, as you assumed, interfere with each other. You can check if the
output of the analyzer would be good at all if you just define one and use
it via the REST interface.

if you mean the language analyzer for arabic, this one is, as stated in
the documentation, already optimized for arabic language, thus you can't
define the tokenizer. if you create your own custom analyzer with arabic
stemmer then you can provide your own tokenizer, but you have to know which
one to use to get the best results.

and, a stemmer is for one language only

hth
david

On Wednesday, June 12, 2013 9:25:13 AM UTC+2, tarang dawer wrote:

Hi
I am indexing some data which is a mixture of arabic and english
content.

Since the content is huge in size , thus, to avoid storage overhead ,
multi-field option is not prefrable (1 field with arabic analyzer and 2nd
with snowball analyzer) .

So , does any analyzer (snowball, arabic , or any other ) has the
capability to stem both arabic and english ?

Please Help

Thanks
Tarang Dawer

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

tarang_dawer · June 17, 2013, 7:39am

Ok , Got it.
Thanks again for your response.

Regards
Tarang Dawer

On Mon, Jun 17, 2013 at 1:03 PM, Itamar Syn-Hershko itamar@code972.comwrote:

For both English and Arabic. Because today there are better ways to deal
with the problems stop words pose - for example by using common grams or
common terms query

On Mon, Jun 17, 2013 at 10:32 AM, Tarang Dawer tarang.dawer@gmail.comwrote:

Thanks for your reply Itamar .

Very helpful certainly.
But , why not to use stop words filter ?
( for arabic only or for both english as well as arabic ? )

Thanks
Tarang Dawer

On Mon, Jun 17, 2013 at 12:49 PM, Itamar Syn-Hershko itamar@code972.comwrote:

Combining 2 languages like Arabic and English is doable, since the
Unicode code points always differ and you don't need to worry about one
analyzer messing with the results of another, and the work of the 2
analyzers is pretty much orthogonal. Any other 2 languages I would say
don't do that, but English and Arabic should work well.

And don't use stop words

On Mon, Jun 17, 2013 at 10:15 AM, Tarang Dawer tarang.dawer@gmail.comwrote:

Hi
Thanks for you reply David.

Arabic analyzer also uses the standard tokenizer , so wouldn't
configuring a custom analyzer with standard tokenizer , and having
following filters : - 1. English Stemmer 2.English Stop Words 1. Arabic
Stemmer 4. Arabic Stop Words
, be enough ?
(i am thinking like , an english word would not be found in arabic stop
words list , as well as , arabic stemmer would not be able to extract the
root from it. , and thus vice versa for the arabic word and english filters
) ?

On Thu, Jun 13, 2013 at 7:16 PM, dp.desoma@gmail.com wrote:

have you tried the following plugin? https://github.com/**
yakaz/elasticsearch-analysis-**combohttps://github.com/yakaz/elasticsearch-analysis-combo
haven't used it myself, but looks promising.

I'm no expert with stemming, but in my understanding I'd say: as for
combining multiple stemmers in one analyzer wouldn't be a good idea, as
they would, as you assumed, interfere with each other. You can check if the
output of the analyzer would be good at all if you just define one and use
it via the REST interface.

if you mean the language analyzer for arabic, this one is, as stated
in the documentation, already optimized for arabic language, thus you can't
define the tokenizer. if you create your own custom analyzer with arabic
stemmer then you can provide your own tokenizer, but you have to know which
one to use to get the best results.

and, a stemmer is for one language only

hth
david

On Wednesday, June 12, 2013 9:25:13 AM UTC+2, tarang dawer wrote:

Hi
I am indexing some data which is a mixture of arabic and english
content.

Since the content is huge in size , thus, to avoid storage overhead ,
multi-field option is not prefrable (1 field with arabic analyzer and 2nd
with snowball analyzer) .

So , does any analyzer (snowball, arabic , or any other ) has the
capability to stem both arabic and english ?

Please Help

Thanks
Tarang Dawer

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Arabic stemmer and synonymous Elasticsearch	4	2141	December 11, 2017
Arabic Tokenizer Elasticsearch	4	2619	July 6, 2017
Improved stemming for Arabic Elasticsearch	2	1222	July 6, 2017
Stop words not used by the analyzer Elasticsearch	5	614	July 6, 2017
Stemmer token filter result is different that it should be Elasticsearch	2	373	July 6, 2017

Stemming Capability for English+Arabic Content

Related topics