Search in the same phrase

Hey,

I want to offert to my client the ability to look for two words in the same
phrase, in the same field.
Is there a way to do it easily in ElasticSearch ?

Thanks

Loïc

--

Hello Loïc,

There are a few options for phrase searches. The Lucene syntax supports
phrases enclosed in double quotes:
http://lucene.apache.org/core/4_0_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Terms

In ElasticSearch, the Match query supports phrases:
http://www.elasticsearch.org/guide/reference/query-dsl/match-query.html

A more powerful option (and more complex) are the various Span queries:
http://www.elasticsearch.org/guide/reference/query-dsl/span-near-query.html

Cheers,

Ivan

On Wed, Nov 21, 2012 at 11:22 AM, Loïc Bertron loic.bertron@gmail.comwrote:

Hey,

I want to offert to my client the ability to look for two words in the
same phrase, in the same field.
Is there a way to do it easily in ElasticSearch ?

Thanks

Loïc

--

--

Thanks for your answer.

I considered these options, but my case is specific. Span queries require
slop parameter, which indicate distance between 2 terms. But sometimes, you
could have a sentence with 3 words, or even a sentence with a lot more. So
i can't really use this query. This is almost the same concept for phrase
query with ES.
My query would be more like: Term_a and Term_b between Beginning of field
and "." char or between 2 "." char.

--

Storing each sentence in a different instance of the field and using
span_near query is the only solution I can think of at the moment. Here is
an example of how it can be used:

On Wednesday, November 21, 2012 3:10:53 PM UTC-5, Loïc Bertron wrote:

Thanks for your answer.

I considered these options, but my case is specific. Span queries require
slop parameter, which indicate distance between 2 terms. But sometimes, you
could have a sentence with 3 words, or even a sentence with a lot more. So
i can't really use this query. This is almost the same concept for phrase
query with ES.
My query would be more like: Term_a and Term_b between Beginning of field
and "." char or between 2 "." char.

--

Thanks a lot Igor.
Once again you're the man !

Just one quick question about your code : you're using position_offset_gap
in your mapping. What is it for exactly ?

--

Excellent question! The position_offset_gap parameter is the most important
part here, actually. It has to be bigger than expected number of terms in
your longest sentence. It basically specifies the distance between the last
word of one instance of a field and the first word of the next instance of
that field. So, if one instance ends with "... lived a hobbit." and another
starts with "Not a nasty, ..." the distance between the term "hobbit" and
the term "not" will be whatever the position_offset_gap specifies. During
search you should specify span smaller than position_offset_gap, this way
span will not match words that are indexed in different instances of the
field.

On Thursday, November 22, 2012 9:59:49 AM UTC-5, Loïc Bertron wrote:

Thanks a lot Igor.
Once again you're the man !

Just one quick question about your code : you're using position_offset_gap
in your mapping. What is it for exactly ?

--

Nice :slight_smile:

If a document match a mapping like this :

"mappings": {
"doc": {
"properties": {
"paragraph" : {
"type": "multi-field"
"fields": {
"sentence" : {
"type": "string",
"position_offset_gap": 256
}
}
}
}
}
}

So, if i understand, using your logic, i could allow my users to look for 2
terms in the same sentence, but i can also allow users to look for 2 terms
in the same paragraph in omitting the slop parameter and with the same
query, right ?

And the last dummy question, i promise : is there an impact using
"position_offset_gap" for my document in terms of size in the index or any
other impact? It won't make any difference if i have tons of "sentence" in
my docs otherwise that is normal weight ?

Thanks again Igor. That's a nice option. It could be a nice idea to add it
to the doc.

--

Nice :slight_smile:

If a document match a mapping like this :

"mappings": {
"doc": {
"properties": {
"paragraph" : {
"sentence" : {
"type": "string",
"position_offset_gap": 256
}

So, if i understand, using your logic, i could allow my users to look for 2
terms in the same sentence, but i can also allow users to look for 2 terms
in the same paragraph in omitting the slop parameter and with the same
query, right ?

And the last dummy question, i promise : is there an impact using
"position_offset_gap" for my document in terms of size in the index or any
other impact? It won't make any difference if i have tons of "sentence" in
my docs otherwise that is normal weight ?

Thanks again Igor. That's a nice option. It could be a nice idea to add it
to the doc.

--

Nice :slight_smile:

If a document match a mapping like this :

"mappings": {
"doc": {
"properties": {
"paragraph" : {
"sentence" : {
"type": "string",
"position_offset_gap": 256
}
}
}
}
}

So, if i understand, using your logic, i could allow my users to look for 2
terms in the same sentence, but i can also allow users to look for 2 terms
in the same paragraph in omitting the slop parameter and with the same
query, right ?

And the last dummy question, i promise : is there an impact using
"position_offset_gap" for my document in terms of size in the index or any
other impact? It won't make any difference if i have tons of "sentence" in
my docs otherwise that is normal weight ?

Thanks again Igor. That's a nice option. It could be a nice idea to add it
to the doc.

--

That mapping is not going to work because all it essentially does is rename
field "sentence" into "paragraph.sentence". If you need two levels, you
will have to do something more complicated than this. For example, you can
make sentence a nested objecthttp://www.elasticsearch.org/guide/reference/mapping/nested-type.html.
Then for paragraph search, you will use normal boolean search wrapped into nested
queryhttp://www.elasticsearch.org/guide/reference/query-dsl/nested-query.html and
for sentence search you can use span query wrapped into nested query.
Another way of doing this is to index text twice: divided by sentences and
divided by paragraphs and use span query with one field or with another
depending on how you want search to work.

The position_offset_gap just increments the token position, so I cannot
think of any reason why it could have an adverse effect.

On Thursday, November 22, 2012 11:16:04 AM UTC-5, Loïc Bertron wrote:

Nice :slight_smile:

If a document match a mapping like this :

"mappings": {
"doc": {
"properties": {
"paragraph" : {
"sentence" : {
"type": "string",
"position_offset_gap": 256
}
}
}
}
}

So, if i understand, using your logic, i could allow my users to look for
2
terms in the same sentence, but i can also allow users to look for 2 terms
in the same paragraph in omitting the slop parameter and with the same
query, right ?

And the last dummy question, i promise : is there an impact using
"position_offset_gap" for my document in terms of size in the index or any
other impact? It won't make any difference if i have tons of "sentence" in
my docs otherwise that is normal weight ?

Thanks again Igor. That's a nice option. It could be a nice idea to add it
to the doc.

--