Char_filter not applied to a query

I am trying to make several multi-word expressions index and be searchable
completely separately from the words they are made of. E.g. a query about
"red shift" should never return "red" or "shift" mentioned separately, and
vice versa.

To protect those words from the tokenizer, I had to apply a Mapping
char_filter joining them by triple underscores (e.g. "red___shift"). This
works for indexing, and I can see separate indexed entries for the MWEs.
However, they are not searchable without manually putting the triple
underscore in! In other words, a query of "red___shift" produces the
results needed, but a query of "red shift" does not.

Here is a cURL reproduction:
https://gist.github.com/koterpillar/7b11639d4b0a9e38726c

Am I completely wrong in my approach to MWEs?

If not, how can I make char_filter apply to the query so that it matches
the indexed MWE?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I have since found out:

  • "query_string" separates my query into words ("red" and "shift"), from
    there on char_filter is helpless. I can escape the whole query ("query":
    ""red shift""), however, this stops legitimate multi-word queries.
  • "match" search method works as expected.

The only question left is, still, is there a less hacky way of "escaping"
MWEs?

вторник, 4 июня 2013 г., 12:01:34 UTC+10 пользователь Alexey Kotlyarov
написал:

I am trying to make several multi-word expressions index and be searchable
completely separately from the words they are made of. E.g. a query about
"red shift" should never return "red" or "shift" mentioned separately, and
vice versa.

To protect those words from the tokenizer, I had to apply a Mapping
char_filter joining them by triple underscores (e.g. "red___shift"). This
works for indexing, and I can see separate indexed entries for the MWEs.
However, they are not searchable without manually putting the triple
underscore in! In other words, a query of "red___shift" produces the
results needed, but a query of "red shift" does not.

Here is a cURL reproduction:
https://gist.github.com/koterpillar/7b11639d4b0a9e38726c

Am I completely wrong in my approach to MWEs?

If not, how can I make char_filter apply to the query so that it matches
the indexed MWE?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

It appears that you are trying to match a phrase. Which works very well,
but you need the correct query!

Brian

On Monday, June 3, 2013 10:01:34 PM UTC-4, Alexey Kotlyarov wrote:

I am trying to make several multi-word expressions index and be searchable
completely separately from the words they are made of. E.g. a query about
"red shift" should never return "red" or "shift" mentioned separately, and
vice versa.

To protect those words from the tokenizer, I had to apply a Mapping
char_filter joining them by triple underscores (e.g. "red___shift"). This
works for indexing, and I can see separate indexed entries for the MWEs.
However, they are not searchable without manually putting the triple
underscore in! In other words, a query of "red___shift" produces the
results needed, but a query of "red shift" does not.

Here is a cURL reproduction:
https://gist.github.com/koterpillar/7b11639d4b0a9e38726c

Am I completely wrong in my approach to MWEs?

If not, how can I make char_filter apply to the query so that it matches
the indexed MWE?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Looked at match_phrase. Yes, that would be what I need, except I would like
ES to pick phrases out of my query. For example, "red shift time" should
match, independently, on "red shift" and "time", but not on "red time
shift".

2013/6/5 InquiringMind brian.from.fl@gmail.com

It appears that you are trying to match a phrase. Which works very well,
but you need the correct query!

Brian

On Monday, June 3, 2013 10:01:34 PM UTC-4, Alexey Kotlyarov wrote:

I am trying to make several multi-word expressions index and be
searchable completely separately from the words they are made of. E.g. a
query about "red shift" should never return "red" or "shift" mentioned
separately, and vice versa.

To protect those words from the tokenizer, I had to apply a Mapping
char_filter joining them by triple underscores (e.g. "red___shift"). This
works for indexing, and I can see separate indexed entries for the MWEs.
However, they are not searchable without manually putting the triple
underscore in! In other words, a query of "red___shift" produces the
results needed, but a query of "red shift" does not.

Here is a cURL reproduction: https://gist.github.com/koterpillar/
7b11639d4b0a9e38726chttps://gist.github.com/koterpillar/7b11639d4b0a9e38726c

Am I completely wrong in my approach to MWEs?

If not, how can I make char_filter apply to the query so that it matches
the indexed MWE?

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/o4MmRObHFYo/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

To add some complexity, how do I run the tokenizer on each word but still
only match the phrase? For example, if I specify "read write", it should be
found for a query like "reading writing".

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Alexey

I would recommend that you look into some stemming tokenizer , (try looking
at snowball analyzer).

As for the match_phrase, it will only match "red shift" and "time" from the
indexed item "red shift time", but will not match "red time shift".

Regards
Tarang Dawer

On Wed, Jun 5, 2013 at 4:18 AM, Alexey Kotlyarov koterpillar@gmail.comwrote:

To add some complexity, how do I run the tokenizer on each word but
still only match the phrase? For example, if I specify "read write", it
should be found for a query like "reading writing".

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I ended up removing the char_filter altogether and using synonyms filter,
after the snowball one. The only problem left: if a part of an MWE (e.g.
"respite") is stemmed, I have to put the end result ("respit") into the
synonym mapping.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Does it help to place the synonyms filter before the snowball filter,
instead of after?

On Wednesday, June 5, 2013 1:10:08 AM UTC-4, Alexey Kotlyarov wrote:

I ended up removing the char_filter altogether and using synonyms filter,
after the snowball one. The only problem left: if a part of an MWE (e.g.
"respite") is stemmed, I have to put the end result ("respit") into the
synonym mapping.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.