Can I exclude certain fields from being stemmed when rolled into a stemmed composite field via multifield? Can Keyword Filter / Stemmer Override be used?

Hello,

I would like to prevent certain fields that are fed into "_all" or some
custom composite multifield that has stemming enabled from being stemmed.

I do not know exactly how multifield and _all work. Are tokens generated on
original fields get duplicated and channeled to _all or multifields? If so
is there any way to inject token filter before it is channeled to
_all/multifield
If so is it possible to use a token filter to mark them as keywords so they
do not get stemmed when added to _all?

What's the difference between those two filters:
They seem to do the same - protect explicetly listed words from stemming by
applying Keyword flag

Thank you,
Alex

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Bump - Any suggestions?

On Monday, March 25, 2013 5:47:39 PM UTC-4, AlexR wrote:

Hello,

I would like to prevent certain fields that are fed into "_all" or some
custom composite multifield that has stemming enabled from being stemmed.

I do not know exactly how multifield and _all work. Are tokens generated
on original fields get duplicated and channeled to _all or multifields? If
so is there any way to inject token filter before it is channeled to
_all/multifield
If so is it possible to use a token filter to mark them as keywords so
they do not get stemmed when added to _all?

What's the difference between those two filters:
They seem to do the same - protect explicetly listed words from stemming
by applying Keyword flag

Thank you,
Alex

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The keyword analyzer takes input as it is (think of an "input stream
identity analyzer") .

You can easily set up multifield to add other analyzer to the same input
of a field. Multifield branches the input to many fields. Each field
carries its analyzers, and so are tokens generated for the fields. Note
also _all is a field with an analyzer that can be defined. But you will
get weird result for keyword analyzer in _all because _all is a
concatentation of many fields - you will notice that querying such a
field set to keyword analyzer is close to impossible.

Note that there is also word separation and other algorithms in an analyzer.

I hope this helps. Just try multifield with ES to find out how it works.
There is much power behind it, in combination with "analyzer" and "index".

Jörg

Am 25.03.13 22:47, schrieb AlexR:

Hello,

I would like to prevent certain fields that are fed into "_all" or
some custom composite multifield that has stemming enabled from being
stemmed.

I do not know exactly how multifield and _all work. Are tokens
generated on original fields get duplicated and channeled to _all or
multifields? If so is there any way to inject token filter before it
is channeled to _all/multifield
If so is it possible to use a token filter to mark them as keywords so
they do not get stemmed when added to _all?

What's the difference between those two filters:
They seem to do the same - protect explicetly listed words from
stemming by applying Keyword flag

Thank you,
Alex

You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks Jörg,

I use multifields extensively. I think I did not express my question
clearly.

Say I have object Memo with two fields companyName and text. I want a
multifield called memo_search which will combine both fields and it will be
stemmed retaining original terms like this
["standard", "lowercase", "stop", "stem_possessive_english",
"keyword_repeat", "kstem", "stem_deduplicate"]

but I would like companyName in memo_search field to stay unstemmed by
giving its tokens a keyword flag to make stemmer of memo_search of to
ignore it (stemmers do not modify any token marked as a keyword). Keyword
Filter and Stemmer Override use this strategy to preserve predefined tokens
from stemming)

So my question is:

  1. When I define a multifield mapping will ES take value of the field and
    add it to my multifield or will it copy tokens produced during processing
    of the primary field and send them to secondary multifield(s)?
  2. If answer in #1 is tokens, is there any way I can mark all tokens as
    keywords so stemmer in memo_search will pass them unchanged
  3. If not, Is there any way to specify additional filters on a secondary
    multifield so they will be applied prior to filters defined for the
    multifield itself (memo_search) so I can mark them as keywords

Now I know you will say why not collect my memo fields in two multifields -
one stemmed and one not with additional benefit of boosting unstemmed field
in searches. Yes it would be great if it we just handful of fields but I
have a deeply nested JSON with couple of hundreds of fileds and besides
stemming I want shingles so mappings grow like a snowball and become very
error prone on change so I am exploring the alternative option. This option
of course has one potential issue - I will be running stemmed query over a
mix of stemmed and unstemmed text which could give me some false positives
on unstemed data (i.e. searching for "running" it stemmed to "run" and
"running" in query (I retain original values in my analyzer) and I get
match on "Run for life Inc" company name - undesired but so I will have it
with two stemmed/unstemmed fields except for custom bust I could use with
two fields

On Tuesday, March 26, 2013 10:32:16 AM UTC-4, Jörg Prante wrote:

The keyword analyzer takes input as it is (think of an "input stream
identity analyzer") .

You can easily set up multifield to add other analyzer to the same input
of a field. Multifield branches the input to many fields. Each field
carries its analyzers, and so are tokens generated for the fields. Note
also _all is a field with an analyzer that can be defined. But you will
get weird result for keyword analyzer in _all because _all is a
concatentation of many fields - you will notice that querying such a
field set to keyword analyzer is close to impossible.

Note that there is also word separation and other algorithms in an
analyzer.

I hope this helps. Just try multifield with ES to find out how it works.
There is much power behind it, in combination with "analyzer" and "index".

Jörg

Am 25.03.13 22:47, schrieb AlexR:

Hello,

I would like to prevent certain fields that are fed into "_all" or
some custom composite multifield that has stemming enabled from being
stemmed.

I do not know exactly how multifield and _all work. Are tokens
generated on original fields get duplicated and channeled to _all or
multifields? If so is there any way to inject token filter before it
is channeled to _all/multifield
If so is it possible to use a token filter to mark them as keywords so
they do not get stemmed when added to _all?

What's the difference between those two filters:
They seem to do the same - protect explicetly listed words from
stemming by applying Keyword flag

Thank you,
Alex

You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.