Match text ending in a word


(Mike) #1

Is there a way to search for text that ends with a word?
For example, if I wanted to match text that ended in "control", "pollution
control" would match but "control price" would not.

--


(Mike) #2

I imagine the only way to do this would be to use a regex query, which
isn't exposed in Elastic at the moment due to it's poor performance?

On Thursday, October 11, 2012 4:59:41 PM UTC-4, Mike wrote:

Is there a way to search for text that ends with a word?
For example, if I wanted to match text that ended in "control", "pollution
control" would match but "control price" would not.

--


(Chris Male) #3

Mike,

Thinking of a couple of solutions. One way would be to use some custom
analysis to mark the last word in your text some way, either by adding a
payload or just adding a prefix/suffix like _last. You could then rewrite
your queries to make use of this mark.

Alternatively (and far more complex) you could create your own
SpanPositionCheckQuery implementation which checked the position of the
matches.

Good luck.

On Friday, October 12, 2012 10:03:25 AM UTC+13, Mike wrote:

I imagine the only way to do this would be to use a regex query, which
isn't exposed in Elastic at the moment due to it's poor performance?

On Thursday, October 11, 2012 4:59:41 PM UTC-4, Mike wrote:

Is there a way to search for text that ends with a word?
For example, if I wanted to match text that ended in "control",
"pollution control" would match but "control price" would not.

--


(simonw-2) #4

another solution is to index your text in a second field and drop
everything but the last word. this obviously only works if you are not
trying to do this on free-text but on titles or something like this. ie.
"pollution control price" would be tricky :wink:

simon

On Friday, October 12, 2012 12:50:49 AM UTC+2, Chris Male wrote:

Mike,

Thinking of a couple of solutions. One way would be to use some custom
analysis to mark the last word in your text some way, either by adding a
payload or just adding a prefix/suffix like _last. You could then rewrite
your queries to make use of this mark.

Alternatively (and far more complex) you could create your own
SpanPositionCheckQuery implementation which checked the position of the
matches.

Good luck.

On Friday, October 12, 2012 10:03:25 AM UTC+13, Mike wrote:

I imagine the only way to do this would be to use a regex query, which
isn't exposed in Elastic at the moment due to it's poor performance?

On Thursday, October 11, 2012 4:59:41 PM UTC-4, Mike wrote:

Is there a way to search for text that ends with a word?
For example, if I wanted to match text that ended in "control",
"pollution control" would match but "control price" would not.

--


(phill) #5

Simon,

Why do you think indexing the last word would be tricky?
It seems to me that the last word in either the body or the title of a
doc seems easy to identify.

A more general solution which Mike didn't say he needed, is to index the
end of the text in reverse. Using Simon's example "ecirp lortnoc
noitullop". You only have to index as much of the end as you allow the
user to search. Then you take whatever the user wants to find, reverse
it, and find it in this special field. You can do this using a prefix
query which should be a lot faster then trying a RE on every value of a
field and a lot easier to write then a special span query or adding a
payload.

Chris, How would adding a payload just to mark one word in the text be
useful? Forget about passing it down through ES to Lucene for a
moment, How could I even build a Lucene query tree that says "and term
X is exactly the last word." which is really not just as easy to
generate in the code as adding an extra sub-expression in the query that
simply says "last_word":"price" (to re-use Simon's example).

This new "ends_with" field is a case where it should not be part of the
"all" field.

If it is just the last word and not the last part of the document, you
also don't need analysis and positions etc. thus the overhead for the
new field seems both reasonable and even similar to adding just enough
payload information to identify the special position of a word. An
example I can think of that might require special payloads is answering
the question: Did the hit highlight come from the very end of the body
of text? That would require some rewriting, but even then I think all
the information is already available in the index it is more of a
problem of find it.

To me this new field is also an example of indexing to support answering
(as fast as possible) a question that you know you need without worrying
about de-normalizing. A lot of the design of an index is about
pre-calculating the values to support the searches and reducing search
time overhead.

-Paul

On 10/12/2012 1:29 AM, simonw wrote:

another solution is to index your text in a second field and drop
everything but the last word. this obviously only works if you are not
trying to do this on free-text but on titles or something like this.
ie. "pollution control price" would be tricky :wink:

simon

--


(simonw-2) #6

Hey Paul,

I might have not been explicit enough in my last post. What I meant by
tricky was if you (reusing my own example :slight_smile: "pollution control price" the
term "control" is a successor of "pollution" and a prefix of "price" so if
you want to solve it here you can't just index the last term ie. this get's
a lot trickier. If that is not the case the second field with only the last
terms is as you said the "fast" and "simple" solution.

I hope that clarifies this a bit.

simon

On Monday, October 15, 2012 8:41:59 PM UTC+2, P Hill wrote:

Simon,

Why do you think indexing the last word would be tricky?
It seems to me that the last word in either the body or the title of a
doc seems easy to identify.

A more general solution which Mike didn't say he needed, is to index the
end of the text in reverse. Using Simon's example "ecirp lortnoc
noitullop". You only have to index as much of the end as you allow the
user to search. Then you take whatever the user wants to find, reverse
it, and find it in this special field. You can do this using a prefix
query which should be a lot faster then trying a RE on every value of a
field and a lot easier to write then a special span query or adding a
payload.

Chris, How would adding a payload just to mark one word in the text be
useful? Forget about passing it down through ES to Lucene for a
moment, How could I even build a Lucene query tree that says "and term
X is exactly the last word." which is really not just as easy to
generate in the code as adding an extra sub-expression in the query that
simply says "last_word":"price" (to re-use Simon's example).

This new "ends_with" field is a case where it should not be part of the
"all" field.

If it is just the last word and not the last part of the document, you
also don't need analysis and positions etc. thus the overhead for the
new field seems both reasonable and even similar to adding just enough
payload information to identify the special position of a word. An
example I can think of that might require special payloads is answering
the question: Did the hit highlight come from the very end of the body
of text? That would require some rewriting, but even then I think all
the information is already available in the index it is more of a
problem of find it.

To me this new field is also an example of indexing to support answering
(as fast as possible) a question that you know you need without worrying
about de-normalizing. A lot of the design of an index is about
pre-calculating the values to support the searches and reducing search
time overhead.

-Paul

On 10/12/2012 1:29 AM, simonw wrote:

another solution is to index your text in a second field and drop
everything but the last word. this obviously only works if you are not
trying to do this on free-text but on titles or something like this.
ie. "pollution control price" would be tricky :wink:

simon

--


(Mike) #7

Thanks a lot guys. Paul's solution works great for me. Just a note for
anyone reading this, make sure you index the field as not_analyzed or with
the keyword analyzer, otherwise your prefix search will match the
individual tokens in the text, defeating the entire purpose.

On Monday, October 15, 2012 2:57:44 PM UTC-4, simonw wrote:

Hey Paul,

I might have not been explicit enough in my last post. What I meant by
tricky was if you (reusing my own example :slight_smile: "pollution control price" the
term "control" is a successor of "pollution" and a prefix of "price" so if
you want to solve it here you can't just index the last term ie. this get's
a lot trickier. If that is not the case the second field with only the last
terms is as you said the "fast" and "simple" solution.

I hope that clarifies this a bit.

simon

On Monday, October 15, 2012 8:41:59 PM UTC+2, P Hill wrote:

Simon,

Why do you think indexing the last word would be tricky?
It seems to me that the last word in either the body or the title of a
doc seems easy to identify.

A more general solution which Mike didn't say he needed, is to index the
end of the text in reverse. Using Simon's example "ecirp lortnoc
noitullop". You only have to index as much of the end as you allow the
user to search. Then you take whatever the user wants to find, reverse
it, and find it in this special field. You can do this using a prefix
query which should be a lot faster then trying a RE on every value of a
field and a lot easier to write then a special span query or adding a
payload.

Chris, How would adding a payload just to mark one word in the text be
useful? Forget about passing it down through ES to Lucene for a
moment, How could I even build a Lucene query tree that says "and term
X is exactly the last word." which is really not just as easy to
generate in the code as adding an extra sub-expression in the query that
simply says "last_word":"price" (to re-use Simon's example).

This new "ends_with" field is a case where it should not be part of the
"all" field.

If it is just the last word and not the last part of the document, you
also don't need analysis and positions etc. thus the overhead for the
new field seems both reasonable and even similar to adding just enough
payload information to identify the special position of a word. An
example I can think of that might require special payloads is answering
the question: Did the hit highlight come from the very end of the body
of text? That would require some rewriting, but even then I think all
the information is already available in the index it is more of a
problem of find it.

To me this new field is also an example of indexing to support answering
(as fast as possible) a question that you know you need without worrying
about de-normalizing. A lot of the design of an index is about
pre-calculating the values to support the searches and reducing search
time overhead.

-Paul

On 10/12/2012 1:29 AM, simonw wrote:

another solution is to index your text in a second field and drop
everything but the last word. this obviously only works if you are not
trying to do this on free-text but on titles or something like this.
ie. "pollution control price" would be tricky :wink:

simon

--


(system) #8