Analyzing wildcard queries

hi

i have text/email addresses indexed with the standard analyzer.

e.g.

"marco.kamm@brain.net" that results in two tokens being in the index:

[marco.kamm] and [brain.net]

i want to search using query_string query and wildcards like:

{
fields:["contact_email"],
"query" : {
"query_string" : {
"query" : "(contact_email:(marco.*@brain.net))",
"default_operator" : "and",
"analyze_wildcard": true
}
}
}

from my past working-experience with lucene i know that wildcards queries
are kind of problematic cause they're not analyzed by default.
(to workaround this behaviour i wrote a custom parser that prepares the
query string depending on the specific field analyzer in prior before
passing it to the lucene query parser)

at first when i noticed the analyze_wildcard parameter/option i thought
great/cool! i no longer need my "custom magic parser ,-)", elasticsearch
provides built-in support for my problems ...

when testing the "analyze_wildcard" behaviour with "pure" prefix queries
like "marco.kamm@brain." it worked like a charm! resp. did the same thing
i tried to achive with my
custom "pre-parser". the query was "transformed" to sth. like
"contact_email:marco.kamm OR contact_email:brain
" that perfectly matches
what's in the index ...

but unfortunately testing with "real" wildcard queries like the above "
marco.@brain.net" is giving me a query that won't find anything in my
situation cause it will be
turned into: "contact_email:marco
brain.net" and there's no single! token
in my index that will match (although it gets analyzed). to find some
results the query rather would have
to be turned int sth. like: "contact_email:marco* AND
contact_email:brain.net" or "contact_email:marco* AND
contact_email:brain.net" (if the user search for "marco..net") ...

by looking at the source code of
org.apache.lucene.queryparser.classic.MapperQueryParser.java (i actually
started to dive into the source code by chasing down the "rather small"
already mentioned issue
with the harcoded boolean.clause OR operator here:
https://github.com/elasticsearch/elasticsearch/issues/2183) i realized that
there are two different methods for analyzing pure wildcard and prefix
queries
(getPossiblyAnalyzedPrefixQuery resp getPossiblyAnalyzedWildcardQuery, i
first expected this cases to be handled by the same code) and that's why
i'm getting the perfect results for prefix queries and sadly not working
ones for
pure wildcard ones ...

i started to experiment/fiddle with the getPossiblyAnalyzedWildcardQuery
method by rewriting it in a way to work more like the
getPossiblyAnalyzedPrefixQuery method resp.
instead of generating only a single one wildcardquery object with the
analyzed string, it builds a boolean query including several wildcardquery
objects (splitting on */?)...

my first tests showed that this would work quite well! ...

now my questions:

what do you think about this "approach"?

do you see any serious drawbacks, besides performance
i know that using even more wildcards will drastically reduce the search
performance
but better trying to finally serve some results after quite long time than
finding nothing at all?

(i also know that lucene is not built/optimized for wildcards queries and
some cases could be resolved using different analyzers (ngram, reverse),
multiple fields etc.
but users are used to, and there could be usecases where such wildcard
queries could make sense
resp. where it's not practicable to use keyword analyzers that wont suffer
from such problems e.g for longer text etc)!

do you plan to further enhance the getPossiblyAnalyzedWildcardQuery method
(although it is stated in the docs that this method does best efforts)?

(btw. do you also plan to fix the OR operator issue, could be rather simple
just use the specified parameter)

if my approach is legit and given that i dont like having to modify the
elasticsearch "core" code and rebuild/adapt it with every new release
how/where else
could i implement such an extension? do i have to write a custom
queryparser (maybe extends MapperQueryParser) and build my own plugin /
rest endpoint ...

(i recently found out that there's also a lucene class called
AnalyzingQueryParser maybe i should have used this one instead of writing
my own magic-parser, is/could this be used somehow in elasticsearch?

is there a possibility to / should i write a feature request for even more
best effor on analyzing wildcard queries. PS i know the wildcard handling
issue could be a pain in the a**, and maybe could only be solved on a best
efford basis?. but i'm somehow forced to mess around with this cause i have
to (want!) to port my old lucene stuff to elasticsearch (except this issue
i think elasticsearch is a great product and i like to work with it. this
problem lies in the nature of inverted indices and wildcards resp.
analyzers)

sorry for the long maybe confusing mail, but i need your expert
thoughts/advices about this wildcard issue

thank you
regards marco

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/62b204b3-fef6-4328-abaa-6b1eae99d1e0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Wildcard and analysis surely need improvements.

There is also weak support for wildcards in phrases. Elasticsearch does not
support ComplexPhraseQueryParser right now:

http://lucene.apache.org/core/4_10_0/queryparser/org/apache/lucene/queryparser/complexPhrase/ComplexPhraseQueryParser.html

but Solr does

https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-ComplexPhraseQueryParser

There are many parsers in Lucene, and more parsers with improvements are on
the way, like this one

https://issues.apache.org/jira/browse/LUCENE-5205

You can fork Elasticsearch source, add your improvements to a branch, so
that we can look at the code.

Do not forget simple_query_string, I have offered a similar patch for
adding prefix analysis to wildcard queries for simple_query_string (porting
the best effort approach from query_string)

Cheers,

Jörg

On Wed, Nov 19, 2014 at 9:56 AM, mkamm78@gmail.com wrote:

hi

i have text/email addresses indexed with the standard analyzer.

e.g.

"marco.kamm@brain.net" that results in two tokens being in the index:

[marco.kamm] and [brain.net]

i want to search using query_string query and wildcards like:

{
fields:["contact_email"],
"query" : {
"query_string" : {
"query" : "(contact_email:(marco.*@brain.net))",
"default_operator" : "and",
"analyze_wildcard": true
}
}
}

from my past working-experience with lucene i know that wildcards queries
are kind of problematic cause they're not analyzed by default.
(to workaround this behaviour i wrote a custom parser that prepares the
query string depending on the specific field analyzer in prior before
passing it to the lucene query parser)

at first when i noticed the analyze_wildcard parameter/option i thought
great/cool! i no longer need my "custom magic parser ,-)", elasticsearch
provides built-in support for my problems ...

when testing the "analyze_wildcard" behaviour with "pure" prefix queries
like "marco.kamm@brain." it worked like a charm! resp. did the same
thing i tried to achive with my
custom "pre-parser". the query was "transformed" to sth. like
"contact_email:marco.kamm OR contact_email:brain
" that perfectly matches
what's in the index ...

but unfortunately testing with "real" wildcard queries like the above "
marco.@brain.net" is giving me a query that won't find anything in my
situation cause it will be
turned into: "contact_email:marco
brain.net" and there's no single! token
in my index that will match (although it gets analyzed). to find some
results the query rather would have
to be turned int sth. like: "contact_email:marco* AND contact_email:
brain.net" or "contact_email:marco* AND contact_email:brain.net" (if the
user search for "marco.
.net") ...

by looking at the source code of
org.apache.lucene.queryparser.classic.MapperQueryParser.java (i actually
started to dive into the source code by chasing down the "rather small"
already mentioned issue
with the harcoded boolean.clause OR operator here:
Analyzed wildcard always uses OR operator on split terms · Issue #2183 · elastic/elasticsearch · GitHub) i realized
that there are two different methods for analyzing pure wildcard and prefix
queries
(getPossiblyAnalyzedPrefixQuery resp getPossiblyAnalyzedWildcardQuery, i
first expected this cases to be handled by the same code) and that's why
i'm getting the perfect results for prefix queries and sadly not working
ones for
pure wildcard ones ...

i started to experiment/fiddle with the getPossiblyAnalyzedWildcardQuery
method by rewriting it in a way to work more like the
getPossiblyAnalyzedPrefixQuery method resp.
instead of generating only a single one wildcardquery object with the
analyzed string, it builds a boolean query including several wildcardquery
objects (splitting on */?)...

my first tests showed that this would work quite well! ...

now my questions:

what do you think about this "approach"?

do you see any serious drawbacks, besides performance
i know that using even more wildcards will drastically reduce the search
performance
but better trying to finally serve some results after quite long time than
finding nothing at all?

(i also know that lucene is not built/optimized for wildcards queries and
some cases could be resolved using different analyzers (ngram, reverse),
multiple fields etc.
but users are used to, and there could be usecases where such wildcard
queries could make sense
resp. where it's not practicable to use keyword analyzers that wont suffer
from such problems e.g for longer text etc)!

do you plan to further enhance the getPossiblyAnalyzedWildcardQuery method
(although it is stated in the docs that this method does best efforts)?

(btw. do you also plan to fix the OR operator issue, could be rather
simple just use the specified parameter)

if my approach is legit and given that i dont like having to modify the
elasticsearch "core" code and rebuild/adapt it with every new release
how/where else
could i implement such an extension? do i have to write a custom
queryparser (maybe extends MapperQueryParser) and build my own plugin /
rest endpoint ...

(i recently found out that there's also a lucene class called
AnalyzingQueryParser maybe i should have used this one instead of writing
my own magic-parser, is/could this be used somehow in elasticsearch?

is there a possibility to / should i write a feature request for even more
best effor on analyzing wildcard queries. PS i know the wildcard handling
issue could be a pain in the a**, and maybe could only be solved on a best
efford basis?. but i'm somehow forced to mess around with this cause i have
to (want!) to port my old lucene stuff to elasticsearch (except this issue
i think elasticsearch is a great product and i like to work with it. this
problem lies in the nature of inverted indices and wildcards resp.
analyzers)

sorry for the long maybe confusing mail, but i need your expert
thoughts/advices about this wildcard issue

thank you
regards marco

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/62b204b3-fef6-4328-abaa-6b1eae99d1e0%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/62b204b3-fef6-4328-abaa-6b1eae99d1e0%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGD7JZ-wxiO6hdGfCewtX%2B0WPjuTgyY0n7Tqm6ZhREc4A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

hi jörg

thank you for your quick response!

glad to hear from you that you agree with me that wildcard analysis could
be further improved. (concerning prefix support it's already great!)
i already started to look around for other solutions like writing a plugin
to use a custom queryparser or sth., but presumed i do not misinterpret
your answer
improving the getPossiblyAnalyzedWildcardQuery wildcard method does not
sound completely absurd to you resp. is not the place/wrong approach
(you also could have told me that i need to write a plugin or somehow
plugin/register kind of queryparser subclass, or some other reasons why
this method is written the way it is)

so for the moment i will stick to/with "my" improved
getPossiblyAnalyzedWildcardQuery method and do further testing with more
data resp. larger indices etc. to see how it performs, (as i initially
mentioned i need to "generate" even more wildcards, also leading ones to
produce the desired results/matches) ...

as soon as i'm convinced of the "improvement" i'll clean up the code and
try to do a fork so you could have a look at it
(PS. i need to familiarize mysef a bit more with git first, since i'm still
one of the oldschool svn guys ;-), but i think somehow i will be able to do
a fork / commit? )...

it would like helping to further improve such a great software/product like
elasticsearch

cheers marco

Am Mittwoch, 19. November 2014 09:56:43 UTC+1 schrieb mka...@gmail.com:

hi

i have text/email addresses indexed with the standard analyzer.

e.g.

"marco.kamm@brain.net" that results in two tokens being in the index:

[marco.kamm] and [brain.net]

i want to search using query_string query and wildcards like:

{
fields:["contact_email"],
"query" : {
"query_string" : {
"query" : "(contact_email:(marco.*@brain.net))",
"default_operator" : "and",
"analyze_wildcard": true
}
}
}

from my past working-experience with lucene i know that wildcards queries
are kind of problematic cause they're not analyzed by default.
(to workaround this behaviour i wrote a custom parser that prepares the
query string depending on the specific field analyzer in prior before
passing it to the lucene query parser)

at first when i noticed the analyze_wildcard parameter/option i thought
great/cool! i no longer need my "custom magic parser ,-)", elasticsearch
provides built-in support for my problems ...

when testing the "analyze_wildcard" behaviour with "pure" prefix queries
like "marco.kamm@brain." it worked like a charm! resp. did the same
thing i tried to achive with my
custom "pre-parser". the query was "transformed" to sth. like
"contact_email:marco.kamm OR contact_email:brain
" that perfectly matches
what's in the index ...

but unfortunately testing with "real" wildcard queries like the above "
marco.@brain.net" is giving me a query that won't find anything in my
situation cause it will be
turned into: "contact_email:marco
brain.net" and there's no single! token
in my index that will match (although it gets analyzed). to find some
results the query rather would have
to be turned int sth. like: "contact_email:marco* AND contact_email:
brain.net" or "contact_email:marco* AND contact_email:brain.net" (if the
user search for "marco.
.net") ...

by looking at the source code of
org.apache.lucene.queryparser.classic.MapperQueryParser.java (i actually
started to dive into the source code by chasing down the "rather small"
already mentioned issue
with the harcoded boolean.clause OR operator here:
Analyzed wildcard always uses OR operator on split terms · Issue #2183 · elastic/elasticsearch · GitHub) i realized
that there are two different methods for analyzing pure wildcard and prefix
queries
(getPossiblyAnalyzedPrefixQuery resp getPossiblyAnalyzedWildcardQuery, i
first expected this cases to be handled by the same code) and that's why
i'm getting the perfect results for prefix queries and sadly not working
ones for
pure wildcard ones ...

i started to experiment/fiddle with the getPossiblyAnalyzedWildcardQuery
method by rewriting it in a way to work more like the
getPossiblyAnalyzedPrefixQuery method resp.
instead of generating only a single one wildcardquery object with the
analyzed string, it builds a boolean query including several wildcardquery
objects (splitting on */?)...

my first tests showed that this would work quite well! ...

now my questions:

what do you think about this "approach"?

do you see any serious drawbacks, besides performance
i know that using even more wildcards will drastically reduce the search
performance
but better trying to finally serve some results after quite long time than
finding nothing at all?

(i also know that lucene is not built/optimized for wildcards queries and
some cases could be resolved using different analyzers (ngram, reverse),
multiple fields etc.
but users are used to, and there could be usecases where such wildcard
queries could make sense
resp. where it's not practicable to use keyword analyzers that wont suffer
from such problems e.g for longer text etc)!

do you plan to further enhance the getPossiblyAnalyzedWildcardQuery method
(although it is stated in the docs that this method does best efforts)?

(btw. do you also plan to fix the OR operator issue, could be rather
simple just use the specified parameter)

if my approach is legit and given that i dont like having to modify the
elasticsearch "core" code and rebuild/adapt it with every new release
how/where else
could i implement such an extension? do i have to write a custom
queryparser (maybe extends MapperQueryParser) and build my own plugin /
rest endpoint ...

(i recently found out that there's also a lucene class called
AnalyzingQueryParser maybe i should have used this one instead of writing
my own magic-parser, is/could this be used somehow in elasticsearch?

is there a possibility to / should i write a feature request for even more
best effor on analyzing wildcard queries. PS i know the wildcard handling
issue could be a pain in the a**, and maybe could only be solved on a best
efford basis?. but i'm somehow forced to mess around with this cause i have
to (want!) to port my old lucene stuff to elasticsearch (except this issue
i think elasticsearch is a great product and i like to work with it. this
problem lies in the nature of inverted indices and wildcards resp.
analyzers)

sorry for the long maybe confusing mail, but i need your expert
thoughts/advices about this wildcard issue

thank you
regards marco

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/556edd4a-5ced-4953-9f4d-ff53fb2bcca6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Regarding github, you can follow the help at

but if you feel more comfortable, you can also just post a diff/patch
somewhere (preferably against HEAD) with your changes/additions.

This would be enough at least for me to have a first look.

Jörg

On Wed, Nov 19, 2014 at 1:30 PM, mkamm78@gmail.com wrote:

hi jörg

thank you for your quick response!

glad to hear from you that you agree with me that wildcard analysis could
be further improved. (concerning prefix support it's already great!)
i already started to look around for other solutions like writing a plugin
to use a custom queryparser or sth., but presumed i do not misinterpret
your answer
improving the getPossiblyAnalyzedWildcardQuery wildcard method does not
sound completely absurd to you resp. is not the place/wrong approach
(you also could have told me that i need to write a plugin or somehow
plugin/register kind of queryparser subclass, or some other reasons why
this method is written the way it is)

so for the moment i will stick to/with "my" improved
getPossiblyAnalyzedWildcardQuery method and do further testing with more
data resp. larger indices etc. to see how it performs, (as i initially
mentioned i need to "generate" even more wildcards, also leading ones to
produce the desired results/matches) ...

as soon as i'm convinced of the "improvement" i'll clean up the code and
try to do a fork so you could have a look at it
(PS. i need to familiarize mysef a bit more with git first, since i'm
still one of the oldschool svn guys ;-), but i think somehow i will be able
to do a fork / commit? )...

it would like helping to further improve such a great software/product
like elasticsearch

cheers marco

Am Mittwoch, 19. November 2014 09:56:43 UTC+1 schrieb mka...@gmail.com:

hi

i have text/email addresses indexed with the standard analyzer.

e.g.

"marco.kamm@brain.net" that results in two tokens being in the index:

[marco.kamm] and [brain.net]

i want to search using query_string query and wildcards like:

{
fields:["contact_email"],
"query" : {
"query_string" : {
"query" : "(contact_email:(marco.*@brain.net))",
"default_operator" : "and",
"analyze_wildcard": true
}
}
}

from my past working-experience with lucene i know that wildcards queries
are kind of problematic cause they're not analyzed by default.
(to workaround this behaviour i wrote a custom parser that prepares the
query string depending on the specific field analyzer in prior before
passing it to the lucene query parser)

at first when i noticed the analyze_wildcard parameter/option i thought
great/cool! i no longer need my "custom magic parser ,-)", elasticsearch
provides built-in support for my problems ...

when testing the "analyze_wildcard" behaviour with "pure" prefix queries
like "marco.kamm@brain." it worked like a charm! resp. did the same
thing i tried to achive with my
custom "pre-parser". the query was "transformed" to sth. like
"contact_email:marco.kamm OR contact_email:brain
" that perfectly matches
what's in the index ...

but unfortunately testing with "real" wildcard queries like the above "
marco.@brain.net" is giving me a query that won't find anything in my
situation cause it will be
turned into: "contact_email:marco
brain.net" and there's no single!
token in my index that will match (although it gets analyzed). to find some
results the query rather would have
to be turned int sth. like: "contact_email:marco* AND contact_email:
brain.net" or "contact_email:marco* AND contact_email:brain.net" (if
the user search for "marco.
.net") ...

by looking at the source code of org.apache.lucene.queryparser.classic.MapperQueryParser.java
(i actually started to dive into the source code by chasing down the
"rather small" already mentioned issue
with the harcoded boolean.clause OR operator here: https://github.com/
elasticsearch/elasticsearch/issues/2183) i realized that there are two
different methods for analyzing pure wildcard and prefix queries
(getPossiblyAnalyzedPrefixQuery resp getPossiblyAnalyzedWildcardQuery, i
first expected this cases to be handled by the same code) and that's why
i'm getting the perfect results for prefix queries and sadly not working
ones for
pure wildcard ones ...

i started to experiment/fiddle with the getPossiblyAnalyzedWildcardQuery
method by rewriting it in a way to work more like the
getPossiblyAnalyzedPrefixQuery method resp.
instead of generating only a single one wildcardquery object with the
analyzed string, it builds a boolean query including several wildcardquery
objects (splitting on */?)...

my first tests showed that this would work quite well! ...

now my questions:

what do you think about this "approach"?

do you see any serious drawbacks, besides performance
i know that using even more wildcards will drastically reduce the search
performance
but better trying to finally serve some results after quite long time
than finding nothing at all?

(i also know that lucene is not built/optimized for wildcards queries and
some cases could be resolved using different analyzers (ngram, reverse),
multiple fields etc.
but users are used to, and there could be usecases where such wildcard
queries could make sense
resp. where it's not practicable to use keyword analyzers that wont
suffer from such problems e.g for longer text etc)!

do you plan to further enhance the getPossiblyAnalyzedWildcardQuery
method (although it is stated in the docs that this method does best
efforts)?

(btw. do you also plan to fix the OR operator issue, could be rather
simple just use the specified parameter)

if my approach is legit and given that i dont like having to modify the
elasticsearch "core" code and rebuild/adapt it with every new release
how/where else
could i implement such an extension? do i have to write a custom
queryparser (maybe extends MapperQueryParser) and build my own plugin /
rest endpoint ...

(i recently found out that there's also a lucene class called
AnalyzingQueryParser maybe i should have used this one instead of writing
my own magic-parser, is/could this be used somehow in elasticsearch?

is there a possibility to / should i write a feature request for even
more best effor on analyzing wildcard queries. PS i know the wildcard
handling issue could be a pain in the a**, and maybe could only be solved
on a best efford basis?. but i'm somehow forced to mess around with this
cause i have to (want!) to port my old lucene stuff to elasticsearch
(except this issue i think elasticsearch is a great product and i like to
work with it. this problem lies in the nature of inverted indices and
wildcards resp. analyzers)

sorry for the long maybe confusing mail, but i need your expert
thoughts/advices about this wildcard issue

thank you
regards marco

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/556edd4a-5ced-4953-9f4d-ff53fb2bcca6%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/556edd4a-5ced-4953-9f4d-ff53fb2bcca6%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFfXx_P8B2XrYw3WFXGMHDQ9N9bDYTZaEi504YnNUoEBw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

hi jörg

just wanted to tell you that i will/can not fork/commit my "improvement" on
wildcard analysis cause i'm no longer 100% convinced
that it is really an improvement resp. can be used in general...

after rethinking i must admit that i was probably too much focused on my
concrete issues with email addresses using the standard analyzer
e.g "marco.kamm@brain" analyzed into the tokens [marco.kamm] [brain.net]
the original idea behind using the standard analyzer was that users will
find sth. when searching for "brain.net" or "marco.kamm" without having to
use any wildcards!
(the old lucene standard analyzer did also split on '.' charaters so even
"marco" or "brain" could be found)

somehow i thought it would also make sense to search for e.g. "
marco.@brain.net" or "marco.kamm@.net"

my first improvement approach was based on the existing code but instead of
concatenating all the analyzed sub-string parts into a single wildcard query
i tried to build a boolean query containing the individual analyzed parts
as either prefix or wildcard queries ...
e.g.
"marco.@brain.net" --> "marco" AND "brain.net"
"marco.kamm@
.net" --> "marco.kamm*" AND "net"
first query can be only prefix query (when not preceeded by a single
wildcard char) and last one could be a "postfix" query
everthing in between was surounded by '
'...'*'

another (optimized) approach is based on the following technique:
generate a random letter sequence that is not present in the search term,
replace the wildcards by this sequence and feed it to the analyzer
this way if the anlayzer produces more than one token out of a single
wildcard input you can be sure that original inputs would also be split
into more terms and you need to use more than one single query obj ...

after analyzing, process the resulting tokens one by one and combine them
into a boolean AND query. foreach token undo the wildcard replacement
and check the occurences of wildcard characters. if a token contains no
wildcards at all use a termquery, if the token only contains a wildcard
char at the end use prefixquery
else use wildcard query ...

e.g.
"marco.@brain.net" --> "marco.{randomLetterSequence}@brain.net" -->
[marco.{randomLetterSequence}] [brain.net] --> "marco.
" AND "brain.net"
"marco.kamm@.net" --> "marco.kamm@{randomLetterSequence}.net" -->
[marco.kamm] [{randomLetterSequence}.net] --> "marco.kamm" AND "
.net"

these approaches could work for my cases (at least they produce some
results where the original code didn't find anything, althought the results
maybe inaccurate but this lies in the nature of AND combinations e.g. "
marco.@brain.net" transformed into "marco." AND "brain.net" could also
find brain.net@marco.org etc.)

but i think for most of the cases (where the queried field uses an analyzer
that doesn't split up terms into several tokens e.g. keyword analyzer etc.
) the existing code does already the best effort that can be done in a
generic way (without knowing what the analyzer is doing with certain
characters)

maybe you can use sth. out of my 2nd. approach with testing the analyzers
behaviour by replacing the wildcards with sth. that doesn't get eaten up to
see if the input is split or not
(i think a sequence of plain asci letters could be a way but i'm not sure
if this could server as a general solution e.g for japanes analyzers etc.
for me a sequence of asci letters seems like kind of lowest common
denominator LCD).

for the moment we're trying to live with the current best effort approach
maybe analyzing some fields twice once with a standard analyzer or sth. and
additionally with a keyword analyzer, and direct pure wildcard queries to
the keywork field. or maybe we're going to split up email addresses into a
seperate username- and domain field etc.

thank you anyway for your time

cheers marco

Am Mittwoch, 19. November 2014 09:56:43 UTC+1 schrieb mka...@gmail.com:

hi

i have text/email addresses indexed with the standard analyzer.

e.g.

"marco.kamm@brain.net" that results in two tokens being in the index:

[marco.kamm] and [brain.net]

i want to search using query_string query and wildcards like:

{
fields:["contact_email"],
"query" : {
"query_string" : {
"query" : "(contact_email:(marco.*@brain.net))",
"default_operator" : "and",
"analyze_wildcard": true
}
}
}

from my past working-experience with lucene i know that wildcards queries
are kind of problematic cause they're not analyzed by default.
(to workaround this behaviour i wrote a custom parser that prepares the
query string depending on the specific field analyzer in prior before
passing it to the lucene query parser)

at first when i noticed the analyze_wildcard parameter/option i thought
great/cool! i no longer need my "custom magic parser ,-)", elasticsearch
provides built-in support for my problems ...

when testing the "analyze_wildcard" behaviour with "pure" prefix queries
like "marco.kamm@brain." it worked like a charm! resp. did the same
thing i tried to achive with my
custom "pre-parser". the query was "transformed" to sth. like
"contact_email:marco.kamm OR contact_email:brain
" that perfectly matches
what's in the index ...

but unfortunately testing with "real" wildcard queries like the above "
marco.@brain.net" is giving me a query that won't find anything in my
situation cause it will be
turned into: "contact_email:marco
brain.net" and there's no single! token
in my index that will match (although it gets analyzed). to find some
results the query rather would have
to be turned int sth. like: "contact_email:marco* AND contact_email:
brain.net" or "contact_email:marco* AND contact_email:brain.net" (if the
user search for "marco.
.net") ...

by looking at the source code of
org.apache.lucene.queryparser.classic.MapperQueryParser.java (i actually
started to dive into the source code by chasing down the "rather small"
already mentioned issue
with the harcoded boolean.clause OR operator here:
Analyzed wildcard always uses OR operator on split terms · Issue #2183 · elastic/elasticsearch · GitHub) i realized
that there are two different methods for analyzing pure wildcard and prefix
queries
(getPossiblyAnalyzedPrefixQuery resp getPossiblyAnalyzedWildcardQuery, i
first expected this cases to be handled by the same code) and that's why
i'm getting the perfect results for prefix queries and sadly not working
ones for
pure wildcard ones ...

i started to experiment/fiddle with the getPossiblyAnalyzedWildcardQuery
method by rewriting it in a way to work more like the
getPossiblyAnalyzedPrefixQuery method resp.
instead of generating only a single one wildcardquery object with the
analyzed string, it builds a boolean query including several wildcardquery
objects (splitting on */?)...

my first tests showed that this would work quite well! ...

now my questions:

what do you think about this "approach"?

do you see any serious drawbacks, besides performance
i know that using even more wildcards will drastically reduce the search
performance
but better trying to finally serve some results after quite long time than
finding nothing at all?

(i also know that lucene is not built/optimized for wildcards queries and
some cases could be resolved using different analyzers (ngram, reverse),
multiple fields etc.
but users are used to, and there could be usecases where such wildcard
queries could make sense
resp. where it's not practicable to use keyword analyzers that wont suffer
from such problems e.g for longer text etc)!

do you plan to further enhance the getPossiblyAnalyzedWildcardQuery method
(although it is stated in the docs that this method does best efforts)?

(btw. do you also plan to fix the OR operator issue, could be rather
simple just use the specified parameter)

if my approach is legit and given that i dont like having to modify the
elasticsearch "core" code and rebuild/adapt it with every new release
how/where else
could i implement such an extension? do i have to write a custom
queryparser (maybe extends MapperQueryParser) and build my own plugin /
rest endpoint ...

(i recently found out that there's also a lucene class called
AnalyzingQueryParser maybe i should have used this one instead of writing
my own magic-parser, is/could this be used somehow in elasticsearch?

is there a possibility to / should i write a feature request for even more
best effor on analyzing wildcard queries. PS i know the wildcard handling
issue could be a pain in the a**, and maybe could only be solved on a best
efford basis?. but i'm somehow forced to mess around with this cause i have
to (want!) to port my old lucene stuff to elasticsearch (except this issue
i think elasticsearch is a great product and i like to work with it. this
problem lies in the nature of inverted indices and wildcards resp.
analyzers)

sorry for the long maybe confusing mail, but i need your expert
thoughts/advices about this wildcard issue

thank you
regards marco

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f662a851-5d96-4412-b79f-d739d6303530%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.