Which analyzing strategy for combintions of case-sensitivity, wildcards and phrase search?

Stefan_Pi_3 · August 6, 2013, 6:36am

I wonder what's the best strategy in analyzing/indexing a text if I want to
have these search options (and any combinations of them) later on the index:

Search for case-sensitive terms as well as case-insensitive terms
Search for wildcard terms as well as exact terms
Search for phrase terms

What I've read so far is that I usually need to create a new field with a
dedicated analyzer for every search option. E.g., if I want case-sensitive
and case-insensitive search then I have to index the text twice, one time
in a case-sensitive analyzed field and one time in a case-insensitive
analyzed field. Same approach goes for wildcards (realized through n-grams)
and exact matches (without n-grams). Is this correct so far?

Does this mean, that if I want any combinations of the mentioned search
options, then I have to create a new field for every combination? How do I
search for a combination then, like for "term1 (case-sensitive) AND term2
(case-insensitve)"?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

brian_yoder · August 6, 2013, 2:19pm

A multifield mapping would be best. One copy of the data can be indexed
several different ways. So for example, your query would specify:

myfield.case_insensitive:foo
myfield.case_sensitive:Foo

And so on. Of course, you can choose shorter names. And one of the mappings
can be the default, so for example you might wish that myfield is the
default assigned to case-insensitive queries.

(I tried to quickly look this up on the on-line guide, then fell back to my
own docs. That book will be very nice!!!)

On Tuesday, August 6, 2013 2:36:05 AM UTC-4, Stefan Pi wrote:

I wonder what's the best strategy in analyzing/indexing a text if I want
to have these search options (and any combinations of them) later on the
index:

Search for case-sensitive terms as well as case-insensitive terms

Search for wildcard terms as well as exact terms

Search for phrase terms

What I've read so far is that I usually need to create a new field with a
dedicated analyzer for every search option. E.g., if I want case-sensitive
and case-insensitive search then I have to index the text twice, one time
in a case-sensitive analyzed field and one time in a case-insensitive
analyzed field. Same approach goes for wildcards (realized through n-grams)
and exact matches (without n-grams). Is this correct so far?

Does this mean, that if I want any combinations of the mentioned search
options, then I have to create a new field for every combination? How do I
search for a combination then, like for "term1 (case-sensitive) AND term2
(case-insensitve)"?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Stefan_Pi_3 · August 8, 2013, 7:30am

But with this approach, how do I search for this phrase: "term1 term2*"?
the first term (term1) should match as it is (exact match) whereas the
second term should match any terms starting with term2 (wildcard match)
plus the second term must be the successor of the first in the original
text (phrase match)? I don't see how this can be arranged with multi
fields. Any ideas?

Thank you!

Am Dienstag, 6. August 2013 16:19:42 UTC+2 schrieb InquiringMind:

A multifield mapping would be best. One copy of the data can be indexed
several different ways. So for example, your query would specify:

myfield.case_insensitive:foo
myfield.case_sensitive:Foo

And so on. Of course, you can choose shorter names. And one of the
mappings can be the default, so for example you might wish that myfieldis the default assigned to case-insensitive queries.

(I tried to quickly look this up on the on-line guide, then fell back to
my own docs. That book will be very nice!!!)

On Tuesday, August 6, 2013 2:36:05 AM UTC-4, Stefan Pi wrote:

I wonder what's the best strategy in analyzing/indexing a text if I want
to have these search options (and any combinations of them) later on the
index:

Search for case-sensitive terms as well as case-insensitive terms

Search for wildcard terms as well as exact terms

Search for phrase terms

What I've read so far is that I usually need to create a new field with a
dedicated analyzer for every search option. E.g., if I want case-sensitive
and case-insensitive search then I have to index the text twice, one time
in a case-sensitive analyzed field and one time in a case-insensitive
analyzed field. Same approach goes for wildcards (realized through n-grams)
and exact matches (without n-grams). Is this correct so far?

Does this mean, that if I want any combinations of the mentioned search
options, then I have to create a new field for every combination? How do I
search for a combination then, like for "term1 (case-sensitive) AND term2
(case-insensitve)"?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

brian_yoder · August 8, 2013, 3:45pm

Stefan,

You are running into a Lucene limitation. It does not permit wildcards in
phrases.

In the long term, I hope to get the chance to teach Lucene how to do this
and more.

In the mean time, you could issue some form of AND or AND-like query (for
example, BoolQuery with must clauses for the individual terms). Then you
could post-process the response documents by implementing a recursive
algorithm that would be able to match phrases and wildcards. I've once
implemented this in C++ as a post-query filter, and it wouldn't be too
difficult to reincarnate it in Java.

But it would be faster if the Lucene index supported word position
natively. Then it could easily match phrases within the index, match
wildcard terms in phrases within the index, and use leapfrog logic to make
it all run blindingly fast. {"dream": "off" }

Brian

On Thursday, August 8, 2013 3:30:19 AM UTC-4, Stefan Pi wrote:

But with this approach, how do I search for this phrase: "term1 term2*"?
the first term (term1) should match as it is (exact match) whereas the
second term should match any terms starting with term2 (wildcard match)
plus the second term must be the successor of the first in the original
text (phrase match)? I don't see how this can be arranged with multi
fields. Any ideas?

Thank you!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Deciding correct analyzer for the field mapping & searching Elasticsearch	1	338	July 6, 2017
Case-insensitive search on not_analyzed field Elasticsearch	3	3295	July 6, 2017
Case insensitive search on not analyzed fields Elasticsearch	3	2123	July 5, 2017
Case sensitive/insensitive search combination in phrase/proximity query Elasticsearch	3	696	July 6, 2017
Analyze case sensitive/insensitive Elasticsearch	2	1675	December 22, 2017

Which analyzing strategy for combintions of case-sensitivity, wildcards and phrase search?

Related topics