Wildcard Query within a Span


(Michael Sander) #1

Hi,

Is it possible to construct an elasticsearch query (or filter) that detects
whether two words with wildcards are within a certain distance of each
other. Is this possible with elasticsearch?

For example, I would like a query that detects whether pret* and ug* are
within five words of each other. Such a query should match "She is pretty
and he is ugly."

I think I would need to use the span_near query, but span_near only accepts
a series of span_term's as arguments and span_term doesn't appear to allow
wildcards.

Is it possible to do this with elasticsearch? If not, is this possible with
Lucene directly?

FYI, I have an SO question open here
http://stackoverflow.com/questions/13258997/elasticsearch-query-wildcard-or-stemming-within-a-span-i-e-proximity-query

--


(Ivan Brusic) #2

ElasticSearch does not support Lucene's SpanRegexQuery or
SpanMultiTermQueryWrapper. Not sure how difficult it is to expose it.

I believe there are some changes related to them in Lucene 4.0, so if
anything was written for ES 0.20 and prior, it would need to be converted.

--
Ivan

On Wed, Nov 7, 2012 at 11:57 AM, Michael Sander michael.sander@gmail.comwrote:

Hi,

Is it possible to construct an elasticsearch query (or filter) that
detects whether two words with wildcards are within a certain distance of
each other. Is this possible with elasticsearch?

For example, I would like a query that detects whether pret* and ug* are
within five words of each other. Such a query should match "She is pretty
and he is ugly."

I think I would need to use the span_near query, but span_near only
accepts a series of span_term's as arguments and span_term doesn't appear
to allow wildcards.

Is it possible to do this with elasticsearch? If not, is this possible
with Lucene directly?

FYI, I have an SO question open here

http://stackoverflow.com/questions/13258997/elasticsearch-query-wildcard-or-stemming-within-a-span-i-e-proximity-query

--

--


(Chris Male) #3

It is definitely possible to rig this sort of complex Query using Lucene.
I would recommend maybe rolling your own QueryParser, maybe based on some
of the existing, that creates the kind of Query you're looking for.

On Thursday, November 8, 2012 6:57:11 AM UTC+11, Michael Sander wrote:

Hi,

Is it possible to construct an elasticsearch query (or filter) that
detects whether two words with wildcards are within a certain distance of
each other. Is this possible with elasticsearch?

For example, I would like a query that detects whether pret* and ug* are
within five words of each other. Such a query should match "She is pretty
and he is ugly."

I think I would need to use the span_near query, but span_near only
accepts a series of span_term's as arguments and span_term doesn't appear
to allow wildcards.

Is it possible to do this with elasticsearch? If not, is this possible
with Lucene directly?

FYI, I have an SO question open here

http://stackoverflow.com/questions/13258997/elasticsearch-query-wildcard-or-stemming-within-a-span-i-e-proximity-query

--


(Michael Sander) #4

Exposing SpanMultiTermQueryWrapper seems like the way to go. QueryParser
may be more powerful, but I doubt it could be exposed through a generic
interface. Any idea if this is possible with an elasticsearch plugin?

On Wednesday, November 7, 2012 10:46:00 PM UTC-5, Chris Male wrote:

It is definitely possible to rig this sort of complex Query using Lucene.
I would recommend maybe rolling your own QueryParser, maybe based on some
of the existing, that creates the kind of Query you're looking for.

On Thursday, November 8, 2012 6:57:11 AM UTC+11, Michael Sander wrote:

Hi,

Is it possible to construct an elasticsearch query (or filter) that
detects whether two words with wildcards are within a certain distance of
each other. Is this possible with elasticsearch?

For example, I would like a query that detects whether pret* and ug* are
within five words of each other. Such a query should match "She is pretty
and he is ugly."

I think I would need to use the span_near query, but span_near only
accepts a series of span_term's as arguments and span_term doesn't appear
to allow wildcards.

Is it possible to do this with elasticsearch? If not, is this possible
with Lucene directly?

FYI, I have an SO question open here

http://stackoverflow.com/questions/13258997/elasticsearch-query-wildcard-or-stemming-within-a-span-i-e-proximity-query

--


(Ivan Brusic) #5

I have never done so, so I could be horribly wrong, but I assume all that
we need to be done is the create a new
org.elasticsearch.index.query.QueryParser that outputs the appropriate
Lucene query and bind it inside IndicesQueriesModule.

On Thu, Nov 8, 2012 at 11:46 AM, Michael Sander michael.sander@gmail.comwrote:

Exposing SpanMultiTermQueryWrapper seems like the way to go. QueryParser
may be more powerful, but I doubt it could be exposed through a generic
interface. Any idea if this is possible with an elasticsearch plugin?

On Wednesday, November 7, 2012 10:46:00 PM UTC-5, Chris Male wrote:

It is definitely possible to rig this sort of complex Query using Lucene.
I would recommend maybe rolling your own QueryParser, maybe based on some
of the existing, that creates the kind of Query you're looking for.

On Thursday, November 8, 2012 6:57:11 AM UTC+11, Michael Sander wrote:

Hi,

Is it possible to construct an elasticsearch query (or filter) that
detects whether two words with wildcards are within a certain distance of
each other. Is this possible with elasticsearch?

For example, I would like a query that detects whether pret* and ug* are
within five words of each other. Such a query should match "She is pretty
and he is ugly."

I think I would need to use the span_near query, but span_near only
accepts a series of span_term's as arguments and span_term doesn't appear
to allow wildcards.

Is it possible to do this with elasticsearch? If not, is this possible
with Lucene directly?

FYI, I have an SO question open here
http://stackoverflow.com/questions/13258997/
elasticsearch-query-wildcard-or-stemming-within-a-span-i-e-
proximity-queryhttp://stackoverflow.com/questions/13258997/elasticsearch-query-wildcard-or-stemming-within-a-span-i-e-proximity-query

--

--


(simonw-2) #6

Hi Michael,

this kind of queries are possible but do you really wanna do this. Take a
step back and think about how we would calculate relevance for this? I
don't think you can expect a reasonable relevance score for such a query
neither a reasonable performance. The fact that lucene allows these kind of
queries is scary enough. :slight_smile: I'd really want to hear what you are trying
to achieve and maybe we can find a better way to do this than multiterms
spans. What is the usecase to allow queries like "pret* and ug*" who types
that in? I mean I could imagine there are usecases like this (lawyers to
weird things with searchengines in the patent space...) but maybe you can
elaborate and we think about a better solution?

simon

On Wednesday, November 7, 2012 8:57:11 PM UTC+1, Michael Sander wrote:

Hi,

Is it possible to construct an elasticsearch query (or filter) that
detects whether two words with wildcards are within a certain distance of
each other. Is this possible with elasticsearch?

For example, I would like a query that detects whether pret* and ug* are
within five words of each other. Such a query should match "She is pretty
and he is ugly."

I think I would need to use the span_near query, but span_near only
accepts a series of span_term's as arguments and span_term doesn't appear
to allow wildcards.

Is it possible to do this with elasticsearch? If not, is this possible
with Lucene directly?

FYI, I have an SO question open here

http://stackoverflow.com/questions/13258997/elasticsearch-query-wildcard-or-stemming-within-a-span-i-e-proximity-query

--


(Michael Sander) #7

Hi Simon,

Yes I really want to do this and your guess is correct: I am working on a
legal research tool. Lawyers use surprisingly sophisticated queries to
research law. For example, a lawyer researching employment discrimination
lawsuits in New York may use the following query:

(employ* within/5 discrimi*) within/20 (black or latino or hispanic or
(african within/3 american)) and "New York"

It seems complex, but searches like this occur all the time and such
functionality is expected. It's one of the reasons Google scholar is not
terribly popular with attorneys. Speed is important but not of extreme
importance. A two or three second wait-time is not a deal breaker, but it
definitely needs to be under ten. To make things run faster, I could limit
wildcard queries to require at least four or five letters.

I will look into creating the plugin, however it does not look like a
simple task.

On Friday, November 9, 2012 3:30:29 AM UTC-5, simonw wrote:

Hi Michael,

this kind of queries are possible but do you really wanna do this. Take a
step back and think about how we would calculate relevance for this? I
don't think you can expect a reasonable relevance score for such a query
neither a reasonable performance. The fact that lucene allows these kind of
queries is scary enough. :slight_smile: I'd really want to hear what you are trying
to achieve and maybe we can find a better way to do this than multiterms
spans. What is the usecase to allow queries like "pret* and ug*" who types
that in? I mean I could imagine there are usecases like this (lawyers to
weird things with searchengines in the patent space...) but maybe you can
elaborate and we think about a better solution?

simon

On Wednesday, November 7, 2012 8:57:11 PM UTC+1, Michael Sander wrote:

Hi,

Is it possible to construct an elasticsearch query (or filter) that
detects whether two words with wildcards are within a certain distance of
each other. Is this possible with elasticsearch?

For example, I would like a query that detects whether pret* and ug* are
within five words of each other. Such a query should match "She is pretty
and he is ugly."

I think I would need to use the span_near query, but span_near only
accepts a series of span_term's as arguments and span_term doesn't appear
to allow wildcards.

Is it possible to do this with elasticsearch? If not, is this possible
with Lucene directly?

FYI, I have an SO question open here

http://stackoverflow.com/questions/13258997/elasticsearch-query-wildcard-or-stemming-within-a-span-i-e-proximity-query

--


(simonw-2) #8

On Friday, November 9, 2012 11:46:35 PM UTC+1, Michael Sander wrote:

Hi Simon,

Yes I really want to do this and your guess is correct: I am working on a
legal research tool. Lawyers use surprisingly sophisticated queries to
research law. For example, a lawyer researching employment discrimination
lawsuits in New York may use the following query:

(employ* within/5 discrimi*) within/20 (black or latino or hispanic or
(african within/3 american)) and "New York"

It seems complex, but searches like this occur all the time and such
functionality is expected. It's one of the reasons Google scholar is not
terribly popular with attorneys. Speed is important but not of extreme
importance. A two or three second wait-time is not a deal breaker, but it
definitely needs to be under ten. To make things run faster, I could limit
wildcard queries to require at least four or five letters.

I will look into creating the plugin, however it does not look like a
simple task.

Hey Michael,

my first reaction: "Oh boy! - this looks very familiar :)"
second reaction: "I knew it it must be legal space :)"

Yeah man I know what you are talking about, been there done that.... from a
query perspective this doesn't seem too hard but it's certainly work to do.
If you have questions please feel free to ask on the list. I will see in
the meanwhile if it makes sense for us to add that to ES.

simon

On Friday, November 9, 2012 3:30:29 AM UTC-5, simonw wrote:

Hi Michael,

this kind of queries are possible but do you really wanna do this. Take a
step back and think about how we would calculate relevance for this? I
don't think you can expect a reasonable relevance score for such a query
neither a reasonable performance. The fact that lucene allows these kind of
queries is scary enough. :slight_smile: I'd really want to hear what you are trying
to achieve and maybe we can find a better way to do this than multiterms
spans. What is the usecase to allow queries like "pret* and ug*" who types
that in? I mean I could imagine there are usecases like this (lawyers to
weird things with searchengines in the patent space...) but maybe you can
elaborate and we think about a better solution?

simon

On Wednesday, November 7, 2012 8:57:11 PM UTC+1, Michael Sander wrote:

Hi,

Is it possible to construct an elasticsearch query (or filter) that
detects whether two words with wildcards are within a certain distance of
each other. Is this possible with elasticsearch?

For example, I would like a query that detects whether pret* and ug* are
within five words of each other. Such a query should match "She is pretty
and he is ugly."

I think I would need to use the span_near query, but span_near only
accepts a series of span_term's as arguments and span_term doesn't appear
to allow wildcards.

Is it possible to do this with elasticsearch? If not, is this possible
with Lucene directly?

FYI, I have an SO question open here

http://stackoverflow.com/questions/13258997/elasticsearch-query-wildcard-or-stemming-within-a-span-i-e-proximity-query

--


(Michael Sander) #9

Exposing lucene's SpanMultiTermQueryWrapper looks like the easiest way to
support sophisticated queries.

I spent a few hours looking over the elasticsearch code and plugin
architecture and I think it's possible to implement this in just a
few lines of code. Unfortunately, I don't think the plugin system offers a
way to install new query types as plugins. All of the QueryBuilders are
hard-coded in QueryBuilders.java.

Is there a way to add new query types as a plugin? If not, maybe I'll just
fork.

On Mon, Nov 12, 2012 at 3:56 AM, simonw
simon.willnauer@elasticsearch.comwrote:

On Friday, November 9, 2012 11:46:35 PM UTC+1, Michael Sander wrote:

Hi Simon,

Yes I really want to do this and your guess is correct: I am working on a
legal research tool. Lawyers use surprisingly sophisticated queries to
research law. For example, a lawyer researching employment discrimination
lawsuits in New York may use the following query:

(employ* within/5 discrimi*) within/20 (black or latino or hispanic or
(african within/3 american)) and "New York"

It seems complex, but searches like this occur all the time and such
functionality is expected. It's one of the reasons Google scholar is not
terribly popular with attorneys. Speed is important but not of extreme
importance. A two or three second wait-time is not a deal breaker, but it
definitely needs to be under ten. To make things run faster, I could limit
wildcard queries to require at least four or five letters.

I will look into creating the plugin, however it does not look like a
simple task.

Hey Michael,

my first reaction: "Oh boy! - this looks very familiar :)"
second reaction: "I knew it it must be legal space :)"

Yeah man I know what you are talking about, been there done that.... from
a query perspective this doesn't seem too hard but it's certainly work to
do. If you have questions please feel free to ask on the list. I will see
in the meanwhile if it makes sense for us to add that to ES.

simon

On Friday, November 9, 2012 3:30:29 AM UTC-5, simonw wrote:

Hi Michael,

this kind of queries are possible but do you really wanna do this. Take
a step back and think about how we would calculate relevance for this? I
don't think you can expect a reasonable relevance score for such a query
neither a reasonable performance. The fact that lucene allows these kind of
queries is scary enough. :slight_smile: I'd really want to hear what you are trying
to achieve and maybe we can find a better way to do this than multiterms
spans. What is the usecase to allow queries like "pret* and ug*" who types
that in? I mean I could imagine there are usecases like this (lawyers to
weird things with searchengines in the patent space...) but maybe you can
elaborate and we think about a better solution?

simon

On Wednesday, November 7, 2012 8:57:11 PM UTC+1, Michael Sander wrote:

Hi,

Is it possible to construct an elasticsearch query (or filter) that
detects whether two words with wildcards are within a certain distance of
each other. Is this possible with elasticsearch?

For example, I would like a query that detects whether pret* and ug*
are within five words of each other. Such a query should match "She is
pretty and he is ugly."

I think I would need to use the span_near query, but span_near only
accepts a series of span_term's as arguments and span_term doesn't appear
to allow wildcards.

Is it possible to do this with elasticsearch? If not, is this possible
with Lucene directly?

FYI, I have an SO question open here
http://stackoverflow.com/questions/13258997/
elasticsearch-query-wildcard-or-stemming-within-a-span-i-e-
proximity-queryhttp://stackoverflow.com/questions/13258997/elasticsearch-query-wildcard-or-stemming-within-a-span-i-e-proximity-query

--

--


(Michael Sander) #10

FYI, I opened issue 2400 on the matter:

On Monday, November 12, 2012 8:47:58 AM UTC-5, Michael Sander wrote:

Exposing lucene's SpanMultiTermQueryWrapper looks like the easiest way to
support sophisticated queries.

I spent a few hours looking over the elasticsearch code and plugin
architecture and I think it's possible to implement this in just a
few lines of code. Unfortunately, I don't think the plugin system offers a
way to install new query types as plugins. All of the QueryBuilders are
hard-coded in QueryBuilders.java.

Is there a way to add new query types as a plugin? If not, maybe I'll just
fork.

On Mon, Nov 12, 2012 at 3:56 AM, simonw wrote:

On Friday, November 9, 2012 11:46:35 PM UTC+1, Michael Sander wrote:

Hi Simon,

Yes I really want to do this and your guess is correct: I am working on
a legal research tool. Lawyers use surprisingly sophisticated queries to
research law. For example, a lawyer researching employment discrimination
lawsuits in New York may use the following query:

(employ* within/5 discrimi*) within/20 (black or latino or hispanic or
(african within/3 american)) and "New York"

It seems complex, but searches like this occur all the time and such
functionality is expected. It's one of the reasons Google scholar is not
terribly popular with attorneys. Speed is important but not of extreme
importance. A two or three second wait-time is not a deal breaker, but it
definitely needs to be under ten. To make things run faster, I could limit
wildcard queries to require at least four or five letters.

I will look into creating the plugin, however it does not look like a
simple task.

Hey Michael,

my first reaction: "Oh boy! - this looks very familiar :)"
second reaction: "I knew it it must be legal space :)"

Yeah man I know what you are talking about, been there done that.... from
a query perspective this doesn't seem too hard but it's certainly work to
do. If you have questions please feel free to ask on the list. I will see
in the meanwhile if it makes sense for us to add that to ES.

simon

On Friday, November 9, 2012 3:30:29 AM UTC-5, simonw wrote:

Hi Michael,

this kind of queries are possible but do you really wanna do this. Take
a step back and think about how we would calculate relevance for this? I
don't think you can expect a reasonable relevance score for such a query
neither a reasonable performance. The fact that lucene allows these kind of
queries is scary enough. :slight_smile: I'd really want to hear what you are trying
to achieve and maybe we can find a better way to do this than multiterms
spans. What is the usecase to allow queries like "pret* and ug*" who types
that in? I mean I could imagine there are usecases like this (lawyers to
weird things with searchengines in the patent space...) but maybe you can
elaborate and we think about a better solution?

simon

On Wednesday, November 7, 2012 8:57:11 PM UTC+1, Michael Sander wrote:

Hi,

Is it possible to construct an elasticsearch query (or filter) that
detects whether two words with wildcards are within a certain distance of
each other. Is this possible with elasticsearch?

For example, I would like a query that detects whether pret* and ug*
are within five words of each other. Such a query should match "She is
pretty and he is ugly."

I think I would need to use the span_near query, but span_near only
accepts a series of span_term's as arguments and span_term doesn't appear
to allow wildcards.

Is it possible to do this with elasticsearch? If not, is this possible
with Lucene directly?

FYI, I have an SO question open here
http://stackoverflow.com/questions/13258997/
elasticsearch-query-wildcard-or-stemming-within-a-span-i-e-
proximity-queryhttp://stackoverflow.com/questions/13258997/elasticsearch-query-wildcard-or-stemming-within-a-span-i-e-proximity-query

--

--


(Ivan Brusic) #11

QueryBuilders methods are simple wrappers around the true constructors.
They are nice to have, but not necessary.

On Tue, Dec 4, 2012 at 6:22 PM, Michael Sander michael.sander@gmail.comwrote:

FYI, I opened issue 2400 on the matter:
https://github.com/elasticsearch/elasticsearch/issues/2400

On Monday, November 12, 2012 8:47:58 AM UTC-5, Michael Sander wrote:

Exposing lucene's **SpanMultiTermQueryWrapper **looks like the easiest
way to support sophisticated queries.

I spent a few hours looking over the elasticsearch code and plugin
architecture and I think it's possible to implement this in just a
few lines of code. Unfortunately, I don't think the plugin system offers a
way to install new query types as plugins. All of the QueryBuilders are
hard-coded in QueryBuilders.java.

Is there a way to add new query types as a plugin? If not, maybe I'll
just fork.

On Mon, Nov 12, 2012 at 3:56 AM, simonw wrote:

On Friday, November 9, 2012 11:46:35 PM UTC+1, Michael Sander wrote:

Hi Simon,

Yes I really want to do this and your guess is correct: I am working on
a legal research tool. Lawyers use surprisingly sophisticated queries to
research law. For example, a lawyer researching employment discrimination
lawsuits in New York may use the following query:

(employ* within/5 discrimi*) within/20 (black or latino or hispanic or
(african within/3 american)) and "New York"

It seems complex, but searches like this occur all the time and such
functionality is expected. It's one of the reasons Google scholar is not
terribly popular with attorneys. Speed is important but not of extreme
importance. A two or three second wait-time is not a deal breaker, but it
definitely needs to be under ten. To make things run faster, I could limit
wildcard queries to require at least four or five letters.

I will look into creating the plugin, however it does not look like a
simple task.

Hey Michael,

my first reaction: "Oh boy! - this looks very familiar :)"
second reaction: "I knew it it must be legal space :)"

Yeah man I know what you are talking about, been there done that....
from a query perspective this doesn't seem too hard but it's certainly work
to do. If you have questions please feel free to ask on the list. I will
see in the meanwhile if it makes sense for us to add that to ES.

simon

On Friday, November 9, 2012 3:30:29 AM UTC-5, simonw wrote:

Hi Michael,

this kind of queries are possible but do you really wanna do this.
Take a step back and think about how we would calculate relevance for this?
I don't think you can expect a reasonable relevance score for such a query
neither a reasonable performance. The fact that lucene allows these kind of
queries is scary enough. :slight_smile: I'd really want to hear what you are trying
to achieve and maybe we can find a better way to do this than multiterms
spans. What is the usecase to allow queries like "pret* and ug*" who types
that in? I mean I could imagine there are usecases like this (lawyers to
weird things with searchengines in the patent space...) but maybe you can
elaborate and we think about a better solution?

simon

On Wednesday, November 7, 2012 8:57:11 PM UTC+1, Michael Sander wrote:

Hi,

Is it possible to construct an elasticsearch query (or filter) that
detects whether two words with wildcards are within a certain distance of
each other. Is this possible with elasticsearch?

For example, I would like a query that detects whether pret* and ug*
are within five words of each other. Such a query should match "She is
pretty and he is ugly."

I think I would need to use the span_near query, but span_near only
accepts a series of span_term's as arguments and span_term doesn't appear
to allow wildcards.

Is it possible to do this with elasticsearch? If not, is this
possible with Lucene directly?

FYI, I have an SO question open here
http://stackoverflow.com/questions/13258997/elasticsearch-
query-wildcard-**or-stemming-**within-a-span-i-e-**proximity-**queryhttp://stackoverflow.com/questions/13258997/elasticsearch-query-wildcard-or-stemming-within-a-span-i-e-proximity-query

--

--

--


(Michael Sander) #12

I'm not sure I understand. Are you saying that it is possible to build a
plugin that implements SpanMultiTermQueryWrapper?

On Wed, Dec 5, 2012 at 1:41 AM, Ivan Brusic ivan@brusic.com wrote:

QueryBuilders methods are simple wrappers around the true constructors.
They are nice to have, but not necessary.

On Tue, Dec 4, 2012 at 6:22 PM, Michael Sander michael.sander@gmail.comwrote:

FYI, I opened issue 2400 on the matter:
https://github.com/elasticsearch/elasticsearch/issues/2400

On Monday, November 12, 2012 8:47:58 AM UTC-5, Michael Sander wrote:

Exposing lucene's **SpanMultiTermQueryWrapper **looks like the easiest
way to support sophisticated queries.

I spent a few hours looking over the elasticsearch code and plugin
architecture and I think it's possible to implement this in just a
few lines of code. Unfortunately, I don't think the plugin system offers a
way to install new query types as plugins. All of the QueryBuilders are
hard-coded in QueryBuilders.java.

Is there a way to add new query types as a plugin? If not, maybe I'll
just fork.

On Mon, Nov 12, 2012 at 3:56 AM, simonw wrote:

On Friday, November 9, 2012 11:46:35 PM UTC+1, Michael Sander wrote:

Hi Simon,

Yes I really want to do this and your guess is correct: I am working
on a legal research tool. Lawyers use surprisingly sophisticated queries
to research law. For example, a lawyer researching employment
discrimination lawsuits in New York may use the following query:

(employ* within/5 discrimi*) within/20 (black or latino or hispanic or
(african within/3 american)) and "New York"

It seems complex, but searches like this occur all the time and such
functionality is expected. It's one of the reasons Google scholar is not
terribly popular with attorneys. Speed is important but not of extreme
importance. A two or three second wait-time is not a deal breaker, but it
definitely needs to be under ten. To make things run faster, I could limit
wildcard queries to require at least four or five letters.

I will look into creating the plugin, however it does not look like a
simple task.

Hey Michael,

my first reaction: "Oh boy! - this looks very familiar :)"
second reaction: "I knew it it must be legal space :)"

Yeah man I know what you are talking about, been there done that....
from a query perspective this doesn't seem too hard but it's certainly work
to do. If you have questions please feel free to ask on the list. I will
see in the meanwhile if it makes sense for us to add that to ES.

simon

On Friday, November 9, 2012 3:30:29 AM UTC-5, simonw wrote:

Hi Michael,

this kind of queries are possible but do you really wanna do this.
Take a step back and think about how we would calculate relevance for this?
I don't think you can expect a reasonable relevance score for such a query
neither a reasonable performance. The fact that lucene allows these kind of
queries is scary enough. :slight_smile: I'd really want to hear what you are trying
to achieve and maybe we can find a better way to do this than multiterms
spans. What is the usecase to allow queries like "pret* and ug*" who types
that in? I mean I could imagine there are usecases like this (lawyers to
weird things with searchengines in the patent space...) but maybe you can
elaborate and we think about a better solution?

simon

On Wednesday, November 7, 2012 8:57:11 PM UTC+1, Michael Sander wrote:

Hi,

Is it possible to construct an elasticsearch query (or filter) that
detects whether two words with wildcards are within a certain distance of
each other. Is this possible with elasticsearch?

For example, I would like a query that detects whether pret* and ug*
are within five words of each other. Such a query should match "She is
pretty and he is ugly."

I think I would need to use the span_near query, but span_near only
accepts a series of span_term's as arguments and span_term doesn't appear
to allow wildcards.

Is it possible to do this with elasticsearch? If not, is this
possible with Lucene directly?

FYI, I have an SO question open here
http://stackoverflow.com/questions/13258997/elasticsearch-
query-wildcard-**or-stemming-**within-a-span-i-e-**proximity-**queryhttp://stackoverflow.com/questions/13258997/elasticsearch-query-wildcard-or-stemming-within-a-span-i-e-proximity-query

--

--

--

--


(Ivan Brusic) #13

What I am saying is that the inability to modify QueryBuilders is not a
stopping point. All it contains are helper methods, which are nice to have,
but not a show-stopper. Are there other blocking points? Perhaps.
Regardless, those Querys should be supported by ElasticSearch since they
are native to Lucene.

Maybe I will try to add those myself. Looking at the commits, the ES team
are working hard on Lucene 4.0. IIRC, span queries changed in Lucene 4, so
perhaps they should finish the conversion first before working on new
features that would to be ported.

--
Ivan

On Tue, Dec 4, 2012 at 11:07 PM, Michael Sander michael.sander@gmail.comwrote:

I'm not sure I understand. Are you saying that it is possible to build a
plugin that implements SpanMultiTermQueryWrapper?

On Wed, Dec 5, 2012 at 1:41 AM, Ivan Brusic ivan@brusic.com wrote:

QueryBuilders methods are simple wrappers around the true constructors.
They are nice to have, but not necessary.

On Tue, Dec 4, 2012 at 6:22 PM, Michael Sander michael.sander@gmail.comwrote:

FYI, I opened issue 2400 on the matter:
https://github.com/elasticsearch/elasticsearch/issues/2400

On Monday, November 12, 2012 8:47:58 AM UTC-5, Michael Sander wrote:

Exposing lucene's **SpanMultiTermQueryWrapper **looks like the easiest
way to support sophisticated queries.

I spent a few hours looking over the elasticsearch code and plugin
architecture and I think it's possible to implement this in just a
few lines of code. Unfortunately, I don't think the plugin system offers a
way to install new query types as plugins. All of the QueryBuilders are
hard-coded in QueryBuilders.java.

Is there a way to add new query types as a plugin? If not, maybe I'll
just fork.

On Mon, Nov 12, 2012 at 3:56 AM, simonw wrote:

On Friday, November 9, 2012 11:46:35 PM UTC+1, Michael Sander wrote:

Hi Simon,

Yes I really want to do this and your guess is correct: I am working
on a legal research tool. Lawyers use surprisingly sophisticated queries
to research law. For example, a lawyer researching employment
discrimination lawsuits in New York may use the following query:

(employ* within/5 discrimi*) within/20 (black or latino or hispanic
or (african within/3 american)) and "New York"

It seems complex, but searches like this occur all the time and such
functionality is expected. It's one of the reasons Google scholar is not
terribly popular with attorneys. Speed is important but not of extreme
importance. A two or three second wait-time is not a deal breaker, but it
definitely needs to be under ten. To make things run faster, I could limit
wildcard queries to require at least four or five letters.

I will look into creating the plugin, however it does not look like a
simple task.

Hey Michael,

my first reaction: "Oh boy! - this looks very familiar :)"
second reaction: "I knew it it must be legal space :)"

Yeah man I know what you are talking about, been there done that....
from a query perspective this doesn't seem too hard but it's certainly work
to do. If you have questions please feel free to ask on the list. I will
see in the meanwhile if it makes sense for us to add that to ES.

simon

On Friday, November 9, 2012 3:30:29 AM UTC-5, simonw wrote:

Hi Michael,

this kind of queries are possible but do you really wanna do this.
Take a step back and think about how we would calculate relevance for this?
I don't think you can expect a reasonable relevance score for such a query
neither a reasonable performance. The fact that lucene allows these kind of
queries is scary enough. :slight_smile: I'd really want to hear what you are trying
to achieve and maybe we can find a better way to do this than multiterms
spans. What is the usecase to allow queries like "pret* and ug*" who types
that in? I mean I could imagine there are usecases like this (lawyers to
weird things with searchengines in the patent space...) but maybe you can
elaborate and we think about a better solution?

simon

On Wednesday, November 7, 2012 8:57:11 PM UTC+1, Michael Sander
wrote:

Hi,

Is it possible to construct an elasticsearch query (or filter) that
detects whether two words with wildcards are within a certain distance of
each other. Is this possible with elasticsearch?

For example, I would like a query that detects whether pret* and
ug* are within five words of each other. Such a query should match "She is
pretty and he is ugly."

I think I would need to use the span_near query, but span_near only
accepts a series of span_term's as arguments and span_term doesn't appear
to allow wildcards.

Is it possible to do this with elasticsearch? If not, is this
possible with Lucene directly?

FYI, I have an SO question open here
http://stackoverflow.com/questions/13258997/elasticsearch-
query-wildcard-**or-stemming-**within-a-span-i-e-proximity-
queryhttp://stackoverflow.com/questions/13258997/elasticsearch-query-wildcard-or-stemming-within-a-span-i-e-proximity-query

--

--

--

--

--


(Christophe V.) #14

Hi Michael,

I have the same need today to search on patent doc.
Do you succeed to do use the span_multi query on elasticsearch ? or do you
migrate to another lucene search engine ?

thanks in advance for your feedback

regards

--
Christophe

Le vendredi 9 novembre 2012 23:46:35 UTC+1, Michael Sander a écrit :

Hi Simon,

Yes I really want to do this and your guess is correct: I am working on a
legal research tool. Lawyers use surprisingly sophisticated queries to
research law. For example, a lawyer researching employment discrimination
lawsuits in New York may use the following query:

(employ* within/5 discrimi*) within/20 (black or latino or hispanic or
(african within/3 american)) and "New York"

It seems complex, but searches like this occur all the time and such
functionality is expected. It's one of the reasons Google scholar is not
terribly popular with attorneys. Speed is important but not of extreme
importance. A two or three second wait-time is not a deal breaker, but it
definitely needs to be under ten. To make things run faster, I could limit
wildcard queries to require at least four or five letters.

I will look into creating the plugin, however it does not look like a
simple task.

On Friday, November 9, 2012 3:30:29 AM UTC-5, simonw wrote:

Hi Michael,

this kind of queries are possible but do you really wanna do this. Take a
step back and think about how we would calculate relevance for this? I
don't think you can expect a reasonable relevance score for such a query
neither a reasonable performance. The fact that lucene allows these kind of
queries is scary enough. :slight_smile: I'd really want to hear what you are trying
to achieve and maybe we can find a better way to do this than multiterms
spans. What is the usecase to allow queries like "pret* and ug*" who types
that in? I mean I could imagine there are usecases like this (lawyers to
weird things with searchengines in the patent space...) but maybe you can
elaborate and we think about a better solution?

simon

On Wednesday, November 7, 2012 8:57:11 PM UTC+1, Michael Sander wrote:

Hi,

Is it possible to construct an elasticsearch query (or filter) that
detects whether two words with wildcards are within a certain distance of
each other. Is this possible with elasticsearch?

For example, I would like a query that detects whether pret* and ug* are
within five words of each other. Such a query should match "She is pretty
and he is ugly."

I think I would need to use the span_near query, but span_near only
accepts a series of span_term's as arguments and span_term doesn't appear
to allow wildcards.

Is it possible to do this with elasticsearch? If not, is this possible
with Lucene directly?

FYI, I have an SO question open here

http://stackoverflow.com/questions/13258997/elasticsearch-query-wildcard-or-stemming-within-a-span-i-e-proximity-query

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Michael Sander) #15

Hi Cristophe,

Yes, I got it working using span_multi. At the time I originally sent this
message, span_multi did not exist. However, it has been since added and
works great.

I am curious... how you are using elastic search in the patent space? My
site also searches legal documents: www.docketalarm.com/search

Best,

Michael Sander
michael.sander@gmail.com
607-227-9859

On Tue, Sep 10, 2013 at 5:20 AM, Christophe V. christophe.viaud@cewo.frwrote:

Hi Michael,

I have the same need today to search on patent doc.
Do you succeed to do use the span_multi query on elasticsearch ? or do you
migrate to another lucene search engine ?

thanks in advance for your feedback

regards

--
Christophe

Le vendredi 9 novembre 2012 23:46:35 UTC+1, Michael Sander a écrit :

Hi Simon,

Yes I really want to do this and your guess is correct: I am working on a
legal research tool. Lawyers use surprisingly sophisticated queries to
research law. For example, a lawyer researching employment discrimination
lawsuits in New York may use the following query:

(employ* within/5 discrimi*) within/20 (black or latino or hispanic or
(african within/3 american)) and "New York"

It seems complex, but searches like this occur all the time and such
functionality is expected. It's one of the reasons Google scholar is not
terribly popular with attorneys. Speed is important but not of extreme
importance. A two or three second wait-time is not a deal breaker, but it
definitely needs to be under ten. To make things run faster, I could limit
wildcard queries to require at least four or five letters.

I will look into creating the plugin, however it does not look like a
simple task.

On Friday, November 9, 2012 3:30:29 AM UTC-5, simonw wrote:

Hi Michael,

this kind of queries are possible but do you really wanna do this. Take
a step back and think about how we would calculate relevance for this? I
don't think you can expect a reasonable relevance score for such a query
neither a reasonable performance. The fact that lucene allows these kind of
queries is scary enough. :slight_smile: I'd really want to hear what you are trying
to achieve and maybe we can find a better way to do this than multiterms
spans. What is the usecase to allow queries like "pret* and ug*" who types
that in? I mean I could imagine there are usecases like this (lawyers to
weird things with searchengines in the patent space...) but maybe you can
elaborate and we think about a better solution?

simon

On Wednesday, November 7, 2012 8:57:11 PM UTC+1, Michael Sander wrote:

Hi,

Is it possible to construct an elasticsearch query (or filter) that
detects whether two words with wildcards are within a certain distance of
each other. Is this possible with elasticsearch?

For example, I would like a query that detects whether pret* and ug*
are within five words of each other. Such a query should match "She is
pretty and he is ugly."

I think I would need to use the span_near query, but span_near only
accepts a series of span_term's as arguments and span_term doesn't appear
to allow wildcards.

Is it possible to do this with elasticsearch? If not, is this possible
with Lucene directly?

FYI, I have an SO question open here
http://stackoverflow.com/questions/13258997/
elasticsearch-query-wildcard-or-stemming-within-a-span-i-e-
proximity-queryhttp://stackoverflow.com/questions/13258997/elasticsearch-query-wildcard-or-stemming-within-a-span-i-e-proximity-query

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/vHQh0ARaAHY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Janssen) #16

Hi Michael,

I have the same requirements you got to 'convert' sophisticated queries
into elasticsearch query dsl.

Would it be possible to get a JSON sample of multi_span usage with complex
proximity ?
For example, what would be the right syntax for this request (is is
possible to request that using ES ?) :

(toto and tata) within/20 (tutu or titi)

Best regards
David

Le mardi 10 septembre 2013 13:42:20 UTC+2, Michael Sander a écrit :

Hi Cristophe,

Yes, I got it working using span_multi. At the time I originally sent this
message, span_multi did not exist. However, it has been since added and
works great.

I am curious... how you are using elastic search in the patent space? My
site also searches legal documents: www.docketalarm.com/search

Best,

Michael Sander
michael.sander@gmail.com
607-227-9859

On Tue, Sep 10, 2013 at 5:20 AM, Christophe V. christophe.viaud@cewo.frwrote:

Hi Michael,

I have the same need today to search on patent doc.
Do you succeed to do use the span_multi query on elasticsearch ? or do
you migrate to another lucene search engine ?

thanks in advance for your feedback

regards

--
Christophe

Le vendredi 9 novembre 2012 23:46:35 UTC+1, Michael Sander a écrit :

Hi Simon,

Yes I really want to do this and your guess is correct: I am working on
a legal research tool. Lawyers use surprisingly sophisticated queries to
research law. For example, a lawyer researching employment discrimination
lawsuits in New York may use the following query:

(employ* within/5 discrimi*) within/20 (black or latino or hispanic or
(african within/3 american)) and "New York"

It seems complex, but searches like this occur all the time and such
functionality is expected. It's one of the reasons Google scholar is not
terribly popular with attorneys. Speed is important but not of extreme
importance. A two or three second wait-time is not a deal breaker, but it
definitely needs to be under ten. To make things run faster, I could limit
wildcard queries to require at least four or five letters.

I will look into creating the plugin, however it does not look like a
simple task.

On Friday, November 9, 2012 3:30:29 AM UTC-5, simonw wrote:

Hi Michael,

this kind of queries are possible but do you really wanna do this. Take
a step back and think about how we would calculate relevance for this? I
don't think you can expect a reasonable relevance score for such a query
neither a reasonable performance. The fact that lucene allows these kind of
queries is scary enough. :slight_smile: I'd really want to hear what you are trying
to achieve and maybe we can find a better way to do this than multiterms
spans. What is the usecase to allow queries like "pret* and ug*" who types
that in? I mean I could imagine there are usecases like this (lawyers to
weird things with searchengines in the patent space...) but maybe you can
elaborate and we think about a better solution?

simon

On Wednesday, November 7, 2012 8:57:11 PM UTC+1, Michael Sander wrote:

Hi,

Is it possible to construct an elasticsearch query (or filter) that
detects whether two words with wildcards are within a certain distance of
each other. Is this possible with elasticsearch?

For example, I would like a query that detects whether pret* and ug*
are within five words of each other. Such a query should match "She is
pretty and he is ugly."

I think I would need to use the span_near query, but span_near only
accepts a series of span_term's as arguments and span_term doesn't appear
to allow wildcards.

Is it possible to do this with elasticsearch? If not, is this possible
with Lucene directly?

FYI, I have an SO question open here
http://stackoverflow.com/questions/13258997/
elasticsearch-query-wildcard-or-stemming-within-a-span-i-e-
proximity-queryhttp://stackoverflow.com/questions/13258997/elasticsearch-query-wildcard-or-stemming-within-a-span-i-e-proximity-query

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/vHQh0ARaAHY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Michael Sander-2) #17

Hi David,

This is from one of my tests:

Input:
'ferm! w/5 outil w/5 dispositif'
Output:
{
'span_near': {
'clauses': [
{'span_multi': {'match': {'prefix': {'text': 'ferm'}}}},
{'span_near': {
'clauses': [
{'span_term': {'text': 'outil'}},
{'span_term': {'text': 'dispositif'}}
],
'collect_payloads': False,
'in_order': False,
'slop': 4}
}
],
'collect_payloads': False,
'in_order': False,
'slop': 4
}
} },

Michael Sander
mes65@cornell.edu
607-227-9859

On Thu, Oct 24, 2013 at 9:38 AM, janssen.dja@gmail.com wrote:

Hi Michael,

I have the same requirements you got to 'convert' sophisticated queries
into elasticsearch query dsl.

Would it be possible to get a JSON sample of multi_span usage with
complex proximity ?
For example, what would be the right syntax for this request (is is
possible to request that using ES ?) :

(toto and tata) within/20 (tutu or titi)

Best regards
David

Le mardi 10 septembre 2013 13:42:20 UTC+2, Michael Sander a écrit :

Hi Cristophe,

Yes, I got it working using span_multi. At the time I originally sent
this message, span_multi did not exist. However, it has been since added
and works great.

I am curious... how you are using elastic search in the patent space? My
site also searches legal documents: www.docketalarm.com/search

Best,

Michael Sander
michael.sander@gmail.com
607-227-9859

On Tue, Sep 10, 2013 at 5:20 AM, Christophe V. christophe.viaud@cewo.frwrote:

Hi Michael,

I have the same need today to search on patent doc.
Do you succeed to do use the span_multi query on elasticsearch ? or do
you migrate to another lucene search engine ?

thanks in advance for your feedback

regards

--
Christophe

Le vendredi 9 novembre 2012 23:46:35 UTC+1, Michael Sander a écrit :

Hi Simon,

Yes I really want to do this and your guess is correct: I am working on
a legal research tool. Lawyers use surprisingly sophisticated queries to
research law. For example, a lawyer researching employment discrimination
lawsuits in New York may use the following query:

(employ* within/5 discrimi*) within/20 (black or latino or hispanic or
(african within/3 american)) and "New York"

It seems complex, but searches like this occur all the time and such
functionality is expected. It's one of the reasons Google scholar is not
terribly popular with attorneys. Speed is important but not of extreme
importance. A two or three second wait-time is not a deal breaker, but it
definitely needs to be under ten. To make things run faster, I could limit
wildcard queries to require at least four or five letters.

I will look into creating the plugin, however it does not look like a
simple task.

On Friday, November 9, 2012 3:30:29 AM UTC-5, simonw wrote:

Hi Michael,

this kind of queries are possible but do you really wanna do this.
Take a step back and think about how we would calculate relevance for this?
I don't think you can expect a reasonable relevance score for such a query
neither a reasonable performance. The fact that lucene allows these kind of
queries is scary enough. :slight_smile: I'd really want to hear what you are trying
to achieve and maybe we can find a better way to do this than multiterms
spans. What is the usecase to allow queries like "pret* and ug*" who types
that in? I mean I could imagine there are usecases like this (lawyers to
weird things with searchengines in the patent space...) but maybe you can
elaborate and we think about a better solution?

simon

On Wednesday, November 7, 2012 8:57:11 PM UTC+1, Michael Sander wrote:

Hi,

Is it possible to construct an elasticsearch query (or filter) that
detects whether two words with wildcards are within a certain distance of
each other. Is this possible with elasticsearch?

For example, I would like a query that detects whether pret* and ug*
are within five words of each other. Such a query should match "She is
pretty and he is ugly."

I think I would need to use the span_near query, but span_near only
accepts a series of span_term's as arguments and span_term doesn't appear
to allow wildcards.

Is it possible to do this with elasticsearch? If not, is this
possible with Lucene directly?

FYI, I have an SO question open here
http://stackoverflow.com/questions/13258997/elasticsearch-
query-wildcard-**or-stemming-**within-a-span-i-e-**proximity-**queryhttp://stackoverflow.com/questions/13258997/elasticsearch-query-wildcard-or-stemming-within-a-span-i-e-proximity-query

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**vHQh0ARaAHY/unsubscribehttps://groups.google.com/d/topic/elasticsearch/vHQh0ARaAHY/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@**googlegroups.comelasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/vHQh0ARaAHY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Michael Sander-2) #18

Here's another
Input:
foo w/3 (biz and bar)

Output:
{
"span_near" : {
"clauses" : [
{ "span_term" : { "text" : "foo" } },
{ "span_near" : {
"clauses" : [
{ "span_term" : { "text" : "biz" } },
{ "span_term" : { "text" : "bar" } },
],
"slop" : int(1e6),
"in_order" : False,
"collect_payloads" : False
} },
],
"slop" : 2,
"in_order" : False,
"collect_payloads" : False
}
}},

Michael Sander
mes65@cornell.edu
607-227-9859

On Thu, Oct 24, 2013 at 10:51 AM, Michael Sander mes65@cornell.edu wrote:

Hi David,

This is from one of my tests:

Input:
'ferm! w/5 outil w/5 dispositif'
Output:
{
'span_near': {
'clauses': [
{'span_multi': {'match': {'prefix': {'text': 'ferm'}}}},
{'span_near': {
'clauses': [
{'span_term': {'text': 'outil'}},
{'span_term': {'text': 'dispositif'}}
],
'collect_payloads': False,
'in_order': False,
'slop': 4}
}
],
'collect_payloads': False,
'in_order': False,
'slop': 4
}
} },

Michael Sander
mes65@cornell.edu
607-227-9859

On Thu, Oct 24, 2013 at 9:38 AM, janssen.dja@gmail.com wrote:

Hi Michael,

I have the same requirements you got to 'convert' sophisticated queries
into elasticsearch query dsl.

Would it be possible to get a JSON sample of multi_span usage with
complex proximity ?
For example, what would be the right syntax for this request (is is
possible to request that using ES ?) :

(toto and tata) within/20 (tutu or titi)

Best regards
David

Le mardi 10 septembre 2013 13:42:20 UTC+2, Michael Sander a écrit :

Hi Cristophe,

Yes, I got it working using span_multi. At the time I originally sent
this message, span_multi did not exist. However, it has been since added
and works great.

I am curious... how you are using elastic search in the patent space? My
site also searches legal documents: www.docketalarm.com/search

Best,

Michael Sander
michael.sander@gmail.com
607-227-9859

On Tue, Sep 10, 2013 at 5:20 AM, Christophe V. <christophe.viaud@cewo.fr

wrote:

Hi Michael,

I have the same need today to search on patent doc.
Do you succeed to do use the span_multi query on elasticsearch ? or do
you migrate to another lucene search engine ?

thanks in advance for your feedback

regards

--
Christophe

Le vendredi 9 novembre 2012 23:46:35 UTC+1, Michael Sander a écrit :

Hi Simon,

Yes I really want to do this and your guess is correct: I am working
on a legal research tool. Lawyers use surprisingly sophisticated queries
to research law. For example, a lawyer researching employment
discrimination lawsuits in New York may use the following query:

(employ* within/5 discrimi*) within/20 (black or latino or hispanic or
(african within/3 american)) and "New York"

It seems complex, but searches like this occur all the time and such
functionality is expected. It's one of the reasons Google scholar is not
terribly popular with attorneys. Speed is important but not of extreme
importance. A two or three second wait-time is not a deal breaker, but it
definitely needs to be under ten. To make things run faster, I could limit
wildcard queries to require at least four or five letters.

I will look into creating the plugin, however it does not look like a
simple task.

On Friday, November 9, 2012 3:30:29 AM UTC-5, simonw wrote:

Hi Michael,

this kind of queries are possible but do you really wanna do this.
Take a step back and think about how we would calculate relevance for this?
I don't think you can expect a reasonable relevance score for such a query
neither a reasonable performance. The fact that lucene allows these kind of
queries is scary enough. :slight_smile: I'd really want to hear what you are trying
to achieve and maybe we can find a better way to do this than multiterms
spans. What is the usecase to allow queries like "pret* and ug*" who types
that in? I mean I could imagine there are usecases like this (lawyers to
weird things with searchengines in the patent space...) but maybe you can
elaborate and we think about a better solution?

simon

On Wednesday, November 7, 2012 8:57:11 PM UTC+1, Michael Sander wrote:

Hi,

Is it possible to construct an elasticsearch query (or filter) that
detects whether two words with wildcards are within a certain distance of
each other. Is this possible with elasticsearch?

For example, I would like a query that detects whether pret* and ug*
are within five words of each other. Such a query should match "She is
pretty and he is ugly."

I think I would need to use the span_near query, but span_near only
accepts a series of span_term's as arguments and span_term doesn't appear
to allow wildcards.

Is it possible to do this with elasticsearch? If not, is this
possible with Lucene directly?

FYI, I have an SO question open here
http://stackoverflow.com/questions/13258997/elasticsearch-
query-wildcard-**or-stemming-**within-a-span-i-e-**proximity-**queryhttp://stackoverflow.com/questions/13258997/elasticsearch-query-wildcard-or-stemming-within-a-span-i-e-proximity-query

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**vHQh0ARaAHY/unsubscribehttps://groups.google.com/d/topic/elasticsearch/vHQh0ARaAHY/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@**googlegroups.comelasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/vHQh0ARaAHY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Michael Sander-2) #19

Note that these are python dictionaries, not JSON, but it's very similar,

Michael Sander
mes65@cornell.edu
607-227-9859

On Thu, Oct 24, 2013 at 10:52 AM, Michael Sander mes65@cornell.edu wrote:

Here's another
Input:
foo w/3 (biz and bar)

Output:
{
"span_near" : {
"clauses" : [
{ "span_term" : { "text" : "foo" } },
{ "span_near" : {
"clauses" : [
{ "span_term" : { "text" : "biz" } },
{ "span_term" : { "text" : "bar" } },
],
"slop" : int(1e6),
"in_order" : False,
"collect_payloads" : False
} },
],
"slop" : 2,
"in_order" : False,
"collect_payloads" : False
}
}},

Michael Sander
mes65@cornell.edu
607-227-9859

On Thu, Oct 24, 2013 at 10:51 AM, Michael Sander mes65@cornell.eduwrote:

Hi David,

This is from one of my tests:

Input:
'ferm! w/5 outil w/5 dispositif'
Output:
{
'span_near': {
'clauses': [
{'span_multi': {'match': {'prefix': {'text': 'ferm'}}}},
{'span_near': {
'clauses': [
{'span_term': {'text': 'outil'}},
{'span_term': {'text': 'dispositif'}}
],
'collect_payloads': False,
'in_order': False,
'slop': 4}
}
],
'collect_payloads': False,
'in_order': False,
'slop': 4
}
} },

Michael Sander
mes65@cornell.edu
607-227-9859

On Thu, Oct 24, 2013 at 9:38 AM, janssen.dja@gmail.com wrote:

Hi Michael,

I have the same requirements you got to 'convert' sophisticated queries
into elasticsearch query dsl.

Would it be possible to get a JSON sample of multi_span usage with
complex proximity ?
For example, what would be the right syntax for this request (is is
possible to request that using ES ?) :

(toto and tata) within/20 (tutu or titi)

Best regards
David

Le mardi 10 septembre 2013 13:42:20 UTC+2, Michael Sander a écrit :

Hi Cristophe,

Yes, I got it working using span_multi. At the time I originally sent
this message, span_multi did not exist. However, it has been since added
and works great.

I am curious... how you are using elastic search in the patent space?
My site also searches legal documents: www.docketalarm.com/search

Best,

Michael Sander
michael.sander@gmail.com
607-227-9859

On Tue, Sep 10, 2013 at 5:20 AM, Christophe V. <
christophe.viaud@cewo.fr> wrote:

Hi Michael,

I have the same need today to search on patent doc.
Do you succeed to do use the span_multi query on elasticsearch ? or do
you migrate to another lucene search engine ?

thanks in advance for your feedback

regards

--
Christophe

Le vendredi 9 novembre 2012 23:46:35 UTC+1, Michael Sander a écrit :

Hi Simon,

Yes I really want to do this and your guess is correct: I am working
on a legal research tool. Lawyers use surprisingly sophisticated queries
to research law. For example, a lawyer researching employment
discrimination lawsuits in New York may use the following query:

(employ* within/5 discrimi*) within/20 (black or latino or hispanic
or (african within/3 american)) and "New York"

It seems complex, but searches like this occur all the time and such
functionality is expected. It's one of the reasons Google scholar is not
terribly popular with attorneys. Speed is important but not of extreme
importance. A two or three second wait-time is not a deal breaker, but it
definitely needs to be under ten. To make things run faster, I could limit
wildcard queries to require at least four or five letters.

I will look into creating the plugin, however it does not look like a
simple task.

On Friday, November 9, 2012 3:30:29 AM UTC-5, simonw wrote:

Hi Michael,

this kind of queries are possible but do you really wanna do this.
Take a step back and think about how we would calculate relevance for this?
I don't think you can expect a reasonable relevance score for such a query
neither a reasonable performance. The fact that lucene allows these kind of
queries is scary enough. :slight_smile: I'd really want to hear what you are trying
to achieve and maybe we can find a better way to do this than multiterms
spans. What is the usecase to allow queries like "pret* and ug*" who types
that in? I mean I could imagine there are usecases like this (lawyers to
weird things with searchengines in the patent space...) but maybe you can
elaborate and we think about a better solution?

simon

On Wednesday, November 7, 2012 8:57:11 PM UTC+1, Michael Sander
wrote:

Hi,

Is it possible to construct an elasticsearch query (or filter) that
detects whether two words with wildcards are within a certain distance of
each other. Is this possible with elasticsearch?

For example, I would like a query that detects whether pret* and
ug* are within five words of each other. Such a query should match "She is
pretty and he is ugly."

I think I would need to use the span_near query, but span_near only
accepts a series of span_term's as arguments and span_term doesn't appear
to allow wildcards.

Is it possible to do this with elasticsearch? If not, is this
possible with Lucene directly?

FYI, I have an SO question open here
http://stackoverflow.com/questions/13258997/elasticsearch-
query-wildcard-**or-stemming-**within-a-span-i-e-proximity-
queryhttp://stackoverflow.com/questions/13258997/elasticsearch-query-wildcard-or-stemming-within-a-span-i-e-proximity-query

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**vHQh0ARaAHY/unsubscribehttps://groups.google.com/d/topic/elasticsearch/vHQh0ARaAHY/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@**googlegroups.comelasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/vHQh0ARaAHY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Janssen) #20

Thank you so much Michael,
These samples are really helpful.

David

Le jeudi 24 octobre 2013 16:53:13 UTC+2, Michael Sander a écrit :

Note that these are python dictionaries, not JSON, but it's very similar,

Michael Sander
me...@cornell.edu <javascript:>
607-227-9859

On Thu, Oct 24, 2013 at 10:52 AM, Michael Sander <me...@cornell.edu<javascript:>

wrote:

Here's another
Input:
foo w/3 (biz and bar)

Output:
{
"span_near" : {
"clauses" : [
{ "span_term" : { "text" : "foo" } },
{ "span_near" : {
"clauses" : [
{ "span_term" : { "text" : "biz" } },
{ "span_term" : { "text" : "bar" } },
],
"slop" : int(1e6),
"in_order" : False,
"collect_payloads" : False
} },
],
"slop" : 2,
"in_order" : False,
"collect_payloads" : False
}
}},

Michael Sander
me...@cornell.edu <javascript:>
607-227-9859

On Thu, Oct 24, 2013 at 10:51 AM, Michael Sander <me...@cornell.edu<javascript:>

wrote:

Hi David,

This is from one of my tests:

Input:
'ferm! w/5 outil w/5 dispositif'
Output:
{
'span_near': {
'clauses': [
{'span_multi': {'match': {'prefix': {'text': 'ferm'}}}},
{'span_near': {
'clauses': [
{'span_term': {'text': 'outil'}},
{'span_term': {'text': 'dispositif'}}
],
'collect_payloads': False,
'in_order': False,
'slop': 4}
}
],
'collect_payloads': False,
'in_order': False,
'slop': 4
}
} },

Michael Sander
me...@cornell.edu <javascript:>
607-227-9859

On Thu, Oct 24, 2013 at 9:38 AM, <janss...@gmail.com <javascript:>>wrote:

Hi Michael,

I have the same requirements you got to 'convert' sophisticated queries
into elasticsearch query dsl.

Would it be possible to get a JSON sample of multi_span usage with
complex proximity ?
For example, what would be the right syntax for this request (is is
possible to request that using ES ?) :

(toto and tata) within/20 (tutu or titi)

Best regards
David

Le mardi 10 septembre 2013 13:42:20 UTC+2, Michael Sander a écrit :

Hi Cristophe,

Yes, I got it working using span_multi. At the time I originally sent
this message, span_multi did not exist. However, it has been since added
and works great.

I am curious... how you are using elastic search in the patent space?
My site also searches legal documents: www.docketalarm.com/search

Best,

Michael Sander
michael...@gmail.com <javascript:>
607-227-9859

On Tue, Sep 10, 2013 at 5:20 AM, Christophe V. <christop...@cewo.fr<javascript:>

wrote:

Hi Michael,

I have the same need today to search on patent doc.
Do you succeed to do use the span_multi query on elasticsearch ? or
do you migrate to another lucene search engine ?

thanks in advance for your feedback

regards

--
Christophe

Le vendredi 9 novembre 2012 23:46:35 UTC+1, Michael Sander a écrit :

Hi Simon,

Yes I really want to do this and your guess is correct: I am working
on a legal research tool. Lawyers use surprisingly sophisticated queries
to research law. For example, a lawyer researching employment
discrimination lawsuits in New York may use the following query:

(employ* within/5 discrimi*) within/20 (black or latino or hispanic
or (african within/3 american)) and "New York"

It seems complex, but searches like this occur all the time and such
functionality is expected. It's one of the reasons Google scholar is not
terribly popular with attorneys. Speed is important but not of extreme
importance. A two or three second wait-time is not a deal breaker, but it
definitely needs to be under ten. To make things run faster, I could limit
wildcard queries to require at least four or five letters.

I will look into creating the plugin, however it does not look like
a simple task.

On Friday, November 9, 2012 3:30:29 AM UTC-5, simonw wrote:

Hi Michael,

this kind of queries are possible but do you really wanna do this.
Take a step back and think about how we would calculate relevance for this?
I don't think you can expect a reasonable relevance score for such a query
neither a reasonable performance. The fact that lucene allows these kind of
queries is scary enough. :slight_smile: I'd really want to hear what you are trying
to achieve and maybe we can find a better way to do this than multiterms
spans. What is the usecase to allow queries like "pret* and ug*" who types
that in? I mean I could imagine there are usecases like this (lawyers to
weird things with searchengines in the patent space...) but maybe you can
elaborate and we think about a better solution?

simon

On Wednesday, November 7, 2012 8:57:11 PM UTC+1, Michael Sander
wrote:

Hi,

Is it possible to construct an elasticsearch query (or filter)
that detects whether two words with wildcards are within a certain distance
of each other. Is this possible with elasticsearch?

For example, I would like a query that detects whether pret* and
ug* are within five words of each other. Such a query should match "She is
pretty and he is ugly."

I think I would need to use the span_near query, but span_near
only accepts a series of span_term's as arguments and span_term doesn't
appear to allow wildcards.

Is it possible to do this with elasticsearch? If not, is this
possible with Lucene directly?

FYI, I have an SO question open here
http://stackoverflow.com/questions/13258997/elasticsearch-
query-wildcard-**or-stemming-**within-a-span-i-e-proximity-
queryhttp://stackoverflow.com/questions/13258997/elasticsearch-query-wildcard-or-stemming-within-a-span-i-e-proximity-query

--
You received this message because you are subscribed to a topic in
the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**vHQh0ARaAHY/unsubscribehttps://groups.google.com/d/topic/elasticsearch/vHQh0ARaAHY/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/vHQh0ARaAHY/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.