Stemming Capability for English+Arabic Content

Hi
I am indexing some data which is a mixture of arabic and english content.

Since the content is huge in size , thus, to avoid storage overhead ,
multi-field option is not prefrable (1 field with arabic analyzer and 2nd
with snowball analyzer) .

So , does any analyzer (snowball, arabic , or any other ) has the
capability to stem both arabic and english ?

Please Help

Thanks
Tarang Dawer

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi
If i configure a custom analyzer with filters :- arabic stopword, english
stopword, arabic stemmer , and english stemmer , will the respective
language filters not interfere with the tokens of other language ?

Also, Will standard tokenizer work fine in this case ? (arabic analyzer des
not have tokenizer type in the documentation)

Could somebody please help me resolve the issue ?

Thanks
Tarang Dawer

On Wed, Jun 12, 2013 at 12:55 PM, Tarang Dawer tarang.dawer@gmail.comwrote:

Hi
I am indexing some data which is a mixture of arabic and english content.

Since the content is huge in size , thus, to avoid storage overhead ,
multi-field option is not prefrable (1 field with arabic analyzer and 2nd
with snowball analyzer) .

So , does any analyzer (snowball, arabic , or any other ) has the
capability to stem both arabic and english ?

Please Help

Thanks
Tarang Dawer

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

have you tried the following plugin?


haven't used it myself, but looks promising.

I'm no expert with stemming, but in my understanding I'd say: as for
combining multiple stemmers in one analyzer wouldn't be a good idea, as
they would, as you assumed, interfere with each other. You can check if the
output of the analyzer would be good at all if you just define one and use
it via the REST interface.

if you mean the language analyzer for arabic, this one is, as stated in the
documentation, already optimized for arabic language, thus you can't define
the tokenizer. if you create your own custom analyzer with arabic stemmer
then you can provide your own tokenizer, but you have to know which one to
use to get the best results.

hth
david

On Thursday, June 13, 2013 8:40:13 AM UTC+2, tarang dawer wrote:

Hi
If i configure a custom analyzer with filters :- arabic stopword, english
stopword, arabic stemmer , and english stemmer , will the respective
language filters not interfere with the tokens of other language ?

Also, Will standard tokenizer work fine in this case ? (arabic analyzer
des not have tokenizer type in the documentation)

Could somebody please help me resolve the issue ?

Thanks
Tarang Dawer

On Wed, Jun 12, 2013 at 12:55 PM, Tarang Dawer <tarang...@gmail.com<javascript:>

wrote:

Hi
I am indexing some data which is a mixture of arabic and english content.

Since the content is huge in size , thus, to avoid storage overhead ,
multi-field option is not prefrable (1 field with arabic analyzer and 2nd
with snowball analyzer) .

So , does any analyzer (snowball, arabic , or any other ) has the
capability to stem both arabic and english ?

Please Help

Thanks
Tarang Dawer

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

have you tried the following plugin?


haven't used it myself, but looks promising.

I'm no expert with stemming, but in my understanding I'd say: as for
combining multiple stemmers in one analyzer wouldn't be a good idea, as
they would, as you assumed, interfere with each other. You can check if the
output of the analyzer would be good at all if you just define one and use
it via the REST interface.

if you mean the language analyzer for arabic, this one is, as stated in the
documentation, already optimized for arabic language, thus you can't define
the tokenizer. if you create your own custom analyzer with arabic stemmer
then you can provide your own tokenizer, but you have to know which one to
use to get the best results.

and, a stemmer is for one language only

hth
david

On Wednesday, June 12, 2013 9:25:13 AM UTC+2, tarang dawer wrote:

Hi
I am indexing some data which is a mixture of arabic and english content.

Since the content is huge in size , thus, to avoid storage overhead ,
multi-field option is not prefrable (1 field with arabic analyzer and 2nd
with snowball analyzer) .

So , does any analyzer (snowball, arabic , or any other ) has the
capability to stem both arabic and english ?

Please Help

Thanks
Tarang Dawer

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi
Thanks for you reply David.

Arabic analyzer also uses the standard tokenizer , so wouldn't configuring
a custom analyzer with standard tokenizer , and having following filters :

    1. English Stemmer 2.English Stop Words 1. Arabic Stemmer 4. Arabic Stop
      Words
      , be enough ?
      (i am thinking like , an english word would not be found in arabic stop
      words list , as well as , arabic stemmer would not be able to extract the
      root from it. , and thus vice versa for the arabic word and english filters
      ) ?

On Thu, Jun 13, 2013 at 7:16 PM, dp.desoma@gmail.com wrote:

have you tried the following plugin? https://github.com/**
yakaz/elasticsearch-analysis-**combohttps://github.com/yakaz/elasticsearch-analysis-combo
haven't used it myself, but looks promising.

I'm no expert with stemming, but in my understanding I'd say: as for
combining multiple stemmers in one analyzer wouldn't be a good idea, as
they would, as you assumed, interfere with each other. You can check if the
output of the analyzer would be good at all if you just define one and use
it via the REST interface.

if you mean the language analyzer for arabic, this one is, as stated in
the documentation, already optimized for arabic language, thus you can't
define the tokenizer. if you create your own custom analyzer with arabic
stemmer then you can provide your own tokenizer, but you have to know which
one to use to get the best results.

and, a stemmer is for one language only

hth
david

On Wednesday, June 12, 2013 9:25:13 AM UTC+2, tarang dawer wrote:

Hi
I am indexing some data which is a mixture of arabic and english content.

Since the content is huge in size , thus, to avoid storage overhead ,
multi-field option is not prefrable (1 field with arabic analyzer and 2nd
with snowball analyzer) .

So , does any analyzer (snowball, arabic , or any other ) has the
capability to stem both arabic and english ?

Please Help

Thanks
Tarang Dawer

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Combining 2 languages like Arabic and English is doable, since the Unicode
code points always differ and you don't need to worry about one analyzer
messing with the results of another, and the work of the 2 analyzers is
pretty much orthogonal. Any other 2 languages I would say don't do that,
but English and Arabic should work well.

And don't use stop words

On Mon, Jun 17, 2013 at 10:15 AM, Tarang Dawer tarang.dawer@gmail.comwrote:

Hi
Thanks for you reply David.

Arabic analyzer also uses the standard tokenizer , so wouldn't configuring
a custom analyzer with standard tokenizer , and having following filters :

    1. English Stemmer 2.English Stop Words 1. Arabic Stemmer 4. Arabic Stop
      Words
      , be enough ?
      (i am thinking like , an english word would not be found in arabic stop
      words list , as well as , arabic stemmer would not be able to extract the
      root from it. , and thus vice versa for the arabic word and english filters
      ) ?

On Thu, Jun 13, 2013 at 7:16 PM, dp.desoma@gmail.com wrote:

have you tried the following plugin? https://github.com/**
yakaz/elasticsearch-analysis-**combohttps://github.com/yakaz/elasticsearch-analysis-combo
haven't used it myself, but looks promising.

I'm no expert with stemming, but in my understanding I'd say: as for
combining multiple stemmers in one analyzer wouldn't be a good idea, as
they would, as you assumed, interfere with each other. You can check if the
output of the analyzer would be good at all if you just define one and use
it via the REST interface.

if you mean the language analyzer for arabic, this one is, as stated in
the documentation, already optimized for arabic language, thus you can't
define the tokenizer. if you create your own custom analyzer with arabic
stemmer then you can provide your own tokenizer, but you have to know which
one to use to get the best results.

and, a stemmer is for one language only

hth
david

On Wednesday, June 12, 2013 9:25:13 AM UTC+2, tarang dawer wrote:

Hi
I am indexing some data which is a mixture of arabic and english content.

Since the content is huge in size , thus, to avoid storage overhead ,
multi-field option is not prefrable (1 field with arabic analyzer and 2nd
with snowball analyzer) .

So , does any analyzer (snowball, arabic , or any other ) has the
capability to stem both arabic and english ?

Please Help

Thanks
Tarang Dawer

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

1 Like

Thanks for your reply Itamar .

Very helpful certainly.
But , why not to use stop words filter ?
( for arabic only or for both english as well as arabic ? )

Thanks
Tarang Dawer

On Mon, Jun 17, 2013 at 12:49 PM, Itamar Syn-Hershko itamar@code972.comwrote:

Combining 2 languages like Arabic and English is doable, since the Unicode
code points always differ and you don't need to worry about one analyzer
messing with the results of another, and the work of the 2 analyzers is
pretty much orthogonal. Any other 2 languages I would say don't do that,
but English and Arabic should work well.

And don't use stop words

On Mon, Jun 17, 2013 at 10:15 AM, Tarang Dawer tarang.dawer@gmail.comwrote:

Hi
Thanks for you reply David.

Arabic analyzer also uses the standard tokenizer , so wouldn't
configuring a custom analyzer with standard tokenizer , and having
following filters : - 1. English Stemmer 2.English Stop Words 1. Arabic
Stemmer 4. Arabic Stop Words
, be enough ?
(i am thinking like , an english word would not be found in arabic stop
words list , as well as , arabic stemmer would not be able to extract the
root from it. , and thus vice versa for the arabic word and english filters
) ?

On Thu, Jun 13, 2013 at 7:16 PM, dp.desoma@gmail.com wrote:

have you tried the following plugin? https://github.com/**
yakaz/elasticsearch-analysis-**combohttps://github.com/yakaz/elasticsearch-analysis-combo
haven't used it myself, but looks promising.

I'm no expert with stemming, but in my understanding I'd say: as for
combining multiple stemmers in one analyzer wouldn't be a good idea, as
they would, as you assumed, interfere with each other. You can check if the
output of the analyzer would be good at all if you just define one and use
it via the REST interface.

if you mean the language analyzer for arabic, this one is, as stated in
the documentation, already optimized for arabic language, thus you can't
define the tokenizer. if you create your own custom analyzer with arabic
stemmer then you can provide your own tokenizer, but you have to know which
one to use to get the best results.

and, a stemmer is for one language only

hth
david

On Wednesday, June 12, 2013 9:25:13 AM UTC+2, tarang dawer wrote:

Hi
I am indexing some data which is a mixture of arabic and english
content.

Since the content is huge in size , thus, to avoid storage overhead ,
multi-field option is not prefrable (1 field with arabic analyzer and 2nd
with snowball analyzer) .

So , does any analyzer (snowball, arabic , or any other ) has the
capability to stem both arabic and english ?

Please Help

Thanks
Tarang Dawer

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

For both English and Arabic. Because today there are better ways to deal
with the problems stop words pose - for example by using common grams or
common terms query

On Mon, Jun 17, 2013 at 10:32 AM, Tarang Dawer tarang.dawer@gmail.comwrote:

Thanks for your reply Itamar .

Very helpful certainly.
But , why not to use stop words filter ?
( for arabic only or for both english as well as arabic ? )

Thanks
Tarang Dawer

On Mon, Jun 17, 2013 at 12:49 PM, Itamar Syn-Hershko itamar@code972.comwrote:

Combining 2 languages like Arabic and English is doable, since the
Unicode code points always differ and you don't need to worry about one
analyzer messing with the results of another, and the work of the 2
analyzers is pretty much orthogonal. Any other 2 languages I would say
don't do that, but English and Arabic should work well.

And don't use stop words

On Mon, Jun 17, 2013 at 10:15 AM, Tarang Dawer tarang.dawer@gmail.comwrote:

Hi
Thanks for you reply David.

Arabic analyzer also uses the standard tokenizer , so wouldn't
configuring a custom analyzer with standard tokenizer , and having
following filters : - 1. English Stemmer 2.English Stop Words 1. Arabic
Stemmer 4. Arabic Stop Words
, be enough ?
(i am thinking like , an english word would not be found in arabic stop
words list , as well as , arabic stemmer would not be able to extract the
root from it. , and thus vice versa for the arabic word and english filters
) ?

On Thu, Jun 13, 2013 at 7:16 PM, dp.desoma@gmail.com wrote:

have you tried the following plugin? https://github.com/**
yakaz/elasticsearch-analysis-**combohttps://github.com/yakaz/elasticsearch-analysis-combo
haven't used it myself, but looks promising.

I'm no expert with stemming, but in my understanding I'd say: as for
combining multiple stemmers in one analyzer wouldn't be a good idea, as
they would, as you assumed, interfere with each other. You can check if the
output of the analyzer would be good at all if you just define one and use
it via the REST interface.

if you mean the language analyzer for arabic, this one is, as stated in
the documentation, already optimized for arabic language, thus you can't
define the tokenizer. if you create your own custom analyzer with arabic
stemmer then you can provide your own tokenizer, but you have to know which
one to use to get the best results.

and, a stemmer is for one language only

hth
david

On Wednesday, June 12, 2013 9:25:13 AM UTC+2, tarang dawer wrote:

Hi
I am indexing some data which is a mixture of arabic and english
content.

Since the content is huge in size , thus, to avoid storage overhead ,
multi-field option is not prefrable (1 field with arabic analyzer and 2nd
with snowball analyzer) .

So , does any analyzer (snowball, arabic , or any other ) has the
capability to stem both arabic and english ?

Please Help

Thanks
Tarang Dawer

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ok , Got it.
Thanks again for your response.

Regards
Tarang Dawer

On Mon, Jun 17, 2013 at 1:03 PM, Itamar Syn-Hershko itamar@code972.comwrote:

For both English and Arabic. Because today there are better ways to deal
with the problems stop words pose - for example by using common grams or
common terms query

On Mon, Jun 17, 2013 at 10:32 AM, Tarang Dawer tarang.dawer@gmail.comwrote:

Thanks for your reply Itamar .

Very helpful certainly.
But , why not to use stop words filter ?
( for arabic only or for both english as well as arabic ? )

Thanks
Tarang Dawer

On Mon, Jun 17, 2013 at 12:49 PM, Itamar Syn-Hershko itamar@code972.comwrote:

Combining 2 languages like Arabic and English is doable, since the
Unicode code points always differ and you don't need to worry about one
analyzer messing with the results of another, and the work of the 2
analyzers is pretty much orthogonal. Any other 2 languages I would say
don't do that, but English and Arabic should work well.

And don't use stop words

On Mon, Jun 17, 2013 at 10:15 AM, Tarang Dawer tarang.dawer@gmail.comwrote:

Hi
Thanks for you reply David.

Arabic analyzer also uses the standard tokenizer , so wouldn't
configuring a custom analyzer with standard tokenizer , and having
following filters : - 1. English Stemmer 2.English Stop Words 1. Arabic
Stemmer 4. Arabic Stop Words
, be enough ?
(i am thinking like , an english word would not be found in arabic stop
words list , as well as , arabic stemmer would not be able to extract the
root from it. , and thus vice versa for the arabic word and english filters
) ?

On Thu, Jun 13, 2013 at 7:16 PM, dp.desoma@gmail.com wrote:

have you tried the following plugin? https://github.com/**
yakaz/elasticsearch-analysis-**combohttps://github.com/yakaz/elasticsearch-analysis-combo
haven't used it myself, but looks promising.

I'm no expert with stemming, but in my understanding I'd say: as for
combining multiple stemmers in one analyzer wouldn't be a good idea, as
they would, as you assumed, interfere with each other. You can check if the
output of the analyzer would be good at all if you just define one and use
it via the REST interface.

if you mean the language analyzer for arabic, this one is, as stated
in the documentation, already optimized for arabic language, thus you can't
define the tokenizer. if you create your own custom analyzer with arabic
stemmer then you can provide your own tokenizer, but you have to know which
one to use to get the best results.

and, a stemmer is for one language only

hth
david

On Wednesday, June 12, 2013 9:25:13 AM UTC+2, tarang dawer wrote:

Hi
I am indexing some data which is a mixture of arabic and english
content.

Since the content is huge in size , thus, to avoid storage overhead ,
multi-field option is not prefrable (1 field with arabic analyzer and 2nd
with snowball analyzer) .

So , does any analyzer (snowball, arabic , or any other ) has the
capability to stem both arabic and english ?

Please Help

Thanks
Tarang Dawer

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.