Stop words and Keyword tokenizer


(germap) #1

Hi all,

I'm looking for a way to remove stop words from tokens returned by a
keyword tokenizer, i.e., I'd like to obtain the original text without stop
words after the analysis process.

Sample data looks like: "El corregimiento de
Mulaló, jurisdicción del municipio de Yumbo (Valle del Cauca)"
After the lowercase token filter: "el corregimiento de mulaló,
jurisdicción del municipio de yumbo (valle del cauca)"
After the ascii folding token filter: "el corregimiento de mulalo,
jurisdiccion del municipio de yumbo (valle del cauca)"
After removing stop words: "corregimiento mulalo,
municipio yumbo (valle cauca)"

The stop words (currently) are: ["la", "el", "de", "del", "los",
"las", "jurisdiccion"]

Is the pattern replace token filter the only (or best) way to go for such a
task?

I'd really like to avoid writing custom regular expressions rather than
specifying a stop words list, which I know would work perfectly fine for
other tokenizers.

Regards,

Germán

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/038ff037-ccf3-4aca-b0c0-bb421531c495%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Itamar Syn-Hershko) #2

What would be the usecase for such a process (removing stop words without
tokenization)?

This may be a good read btw:

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Thu, Aug 28, 2014 at 9:48 PM, German Carrillo carrillo.german@gmail.com
wrote:

Hi all,

I'm looking for a way to remove stop words from tokens returned by a
keyword tokenizer, i.e., I'd like to obtain the original text without stop
words after the analysis process.

Sample data looks like: "El corregimiento de
Mulaló, jurisdicción del municipio de Yumbo (Valle del Cauca)"
After the lowercase token filter: "el corregimiento de mulaló,
jurisdicción del municipio de yumbo (valle del cauca)"
After the ascii folding token filter: "el corregimiento de mulalo,
jurisdiccion del municipio de yumbo (valle del cauca)"
After removing stop words: "corregimiento mulalo,
municipio yumbo (valle cauca)"

The stop words (currently) are: ["la", "el", "de", "del", "los",
"las", "jurisdiccion"]

Is the pattern replace token filter the only (or best) way to go for such
a task?

I'd really like to avoid writing custom regular expressions rather than
specifying a stop words list, which I know would work perfectly fine for
other tokenizers.

Regards,

Germán

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/038ff037-ccf3-4aca-b0c0-bb421531c495%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/038ff037-ccf3-4aca-b0c0-bb421531c495%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zu%2BJGsL7Srsg7inbs3TkejOqp4jFZ1op-18WfiT3VoGOQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Ivan Brusic) #3

Also note that the content returned will still contain the stop words. Only
the inverted index will contain the stopword-less content.

--
Ivan

On Thu, Aug 28, 2014 at 11:55 AM, Itamar Syn-Hershko itamar@code972.com
wrote:

What would be the usecase for such a process (removing stop words without
tokenization)?

This may be a good read btw:
http://www.elasticsearch.org/blog/stop-stopping-stop-words-a-look-at-common-terms-query/

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Thu, Aug 28, 2014 at 9:48 PM, German Carrillo <
carrillo.german@gmail.com> wrote:

Hi all,

I'm looking for a way to remove stop words from tokens returned by a
keyword tokenizer, i.e., I'd like to obtain the original text without stop
words after the analysis process.

Sample data looks like: "El corregimiento de
Mulaló, jurisdicción del municipio de Yumbo (Valle del Cauca)"
After the lowercase token filter: "el corregimiento de mulaló,
jurisdicción del municipio de yumbo (valle del cauca)"
After the ascii folding token filter: "el corregimiento de mulalo,
jurisdiccion del municipio de yumbo (valle del cauca)"
After removing stop words: "corregimiento mulalo,
municipio yumbo (valle cauca)"

The stop words (currently) are: ["la", "el", "de", "del", "los",
"las", "jurisdiccion"]

Is the pattern replace token filter the only (or best) way to go for such
a task?

I'd really like to avoid writing custom regular expressions rather than
specifying a stop words list, which I know would work perfectly fine for
other tokenizers.

Regards,

Germán

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/038ff037-ccf3-4aca-b0c0-bb421531c495%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/038ff037-ccf3-4aca-b0c0-bb421531c495%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zu%2BJGsL7Srsg7inbs3TkejOqp4jFZ1op-18WfiT3VoGOQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zu%2BJGsL7Srsg7inbs3TkejOqp4jFZ1op-18WfiT3VoGOQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCJAM-4nJAKjUix7GvT9766%2B5si_z76txfnt-S-BTJqBw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(germap) #4

The use case I'm addressing right now is searching place hierarchies (that
could include place types as well). In my country, you can specify place
hierarchy in several ways. For instance:

"El corregimiento de Mulaló, jurisdicción del municipio de Yumbo (Valle del
Cauca)"
"El corregimiento de Mulaló, en jurisdicción del municipio de Yumbo del
Valle del Cauca"
"El corregimiento de Mulaló, ubicado en Yumbo, Valle del Cauca"
"El corregimiento de Mulaló, en Yumbo, Valle del Cauca"
"El corregimiento de Mulaló, en el municipio de Yumbo (Valle del Cauca)"
"El corregimiento de Mulaló - Yumbo, Valle del Cauca"
"Mulaló, Yumbo, Valle del Cauca"
"Mulaló, Municipio de Yumbo, en el Valle del Cauca"
"Corregimiento de Mulaló, Municipio de Yumbo, Departamento del Valle del
Cauca"
"Corregimiento de Mulaló, Municipio de Yumbo, Departamento de Valle del
Cauca"
"Corregimiento de Mulaló, Municipio de Yumbo, en el Valle del Cauca"
"Corregimiento de Mulaló, Municipio de Yumbo, en el Valle del Cauca"
...

All of those are equivalent.

I want to get rid of articles ("el", "la", "los", "las"), prepositions
("de", "del"), and other synonyms (e.g. "en" and "jurisdicción", "ubicado
en") so that I can compare analyzed queries with some pre-generated (few)
cases I can handle from my original JSON docs.

Thanks for the link, the only caveat I see is (of course) to figure out the
cutoff_frequency. Additionally, There are other very common words in my
index I wouldn't like to overlook. For instance, a place type such as
"municipio" (municipality) is the second level in the place hierarchy, so
it could appear in any other place from the third level down the hierarchy.
The sample data I mentioned above is a third level place.

2014-08-28 13:55 GMT-05:00 Itamar Syn-Hershko itamar@code972.com:

http://www.elasticsearch.org/blog/stop-stopping-stop-words-a-look-at-common-terms-query/

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CANaz7mx0tqxJsdbHgw9JONUFLWDSW7zdvtA%3DA%2B-yUV%3DN69kXzg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(germap) #5

Thanks Ivan,

do you mean what I obtain from a request such as

curl -XGET
'localhost:9200/_analyze?tokenizer=keyword&filters=lowercase,my_ascii_folding,my_stopwords'
-d 'El corregimiento de Mulaló, jurisdicción del municipio de Yumbo (Valle
del Cauca)'

is not what will be present in the index after the analysis process? If so,
how could I check whether the stop words filter is being (will be) applied
to a sample phrase?

2014-08-28 14:03 GMT-05:00 Ivan Brusic ivan@brusic.com:

Also note that the content returned will still contain the stop words.
Only the inverted index will contain the stopword-less content.

--
Ivan

On Thu, Aug 28, 2014 at 11:55 AM, Itamar Syn-Hershko itamar@code972.com
wrote:

What would be the usecase for such a process (removing stop words without
tokenization)?

This may be a good read btw:
http://www.elasticsearch.org/blog/stop-stopping-stop-words-a-look-at-common-terms-query/

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Thu, Aug 28, 2014 at 9:48 PM, German Carrillo <
carrillo.german@gmail.com> wrote:

Hi all,

I'm looking for a way to remove stop words from tokens returned by a
keyword tokenizer, i.e., I'd like to obtain the original text without stop
words after the analysis process.

Sample data looks like: "El corregimiento de
Mulaló, jurisdicción del municipio de Yumbo (Valle del Cauca)"
After the lowercase token filter: "el corregimiento de mulaló,
jurisdicción del municipio de yumbo (valle del cauca)"
After the ascii folding token filter: "el corregimiento de
mulalo, jurisdiccion del municipio de yumbo (valle del cauca)"
After removing stop words: "corregimiento mulalo,
municipio yumbo (valle cauca)"

The stop words (currently) are: ["la", "el", "de", "del", "los",
"las", "jurisdiccion"]

Is the pattern replace token filter the only (or best) way to go for
such a task?

I'd really like to avoid writing custom regular expressions rather than
specifying a stop words list, which I know would work perfectly fine for
other tokenizers.

Regards,

Germán

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/038ff037-ccf3-4aca-b0c0-bb421531c495%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/038ff037-ccf3-4aca-b0c0-bb421531c495%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zu%2BJGsL7Srsg7inbs3TkejOqp4jFZ1op-18WfiT3VoGOQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zu%2BJGsL7Srsg7inbs3TkejOqp4jFZ1op-18WfiT3VoGOQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCJAM-4nJAKjUix7GvT9766%2B5si_z76txfnt-S-BTJqBw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCJAM-4nJAKjUix7GvT9766%2B5si_z76txfnt-S-BTJqBw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CANaz7mxuoDv3cV83nUgr-SXentuwfBcs3bX8oLMA_tvBd40bWA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Itamar Syn-Hershko) #6

Take a look at suggesters - they are meant for that plus they are more
performant! http://www.elasticsearch.org/blog/you-complete-me/

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Thu, Aug 28, 2014 at 10:22 PM, Germán Carrillo <carrillo.german@gmail.com

wrote:

The use case I'm addressing right now is searching place hierarchies (that
could include place types as well). In my country, you can specify place
hierarchy in several ways. For instance:

"El corregimiento de Mulaló, jurisdicción del municipio de Yumbo (Valle
del Cauca)"
"El corregimiento de Mulaló, en jurisdicción del municipio de Yumbo del
Valle del Cauca"
"El corregimiento de Mulaló, ubicado en Yumbo, Valle del Cauca"
"El corregimiento de Mulaló, en Yumbo, Valle del Cauca"
"El corregimiento de Mulaló, en el municipio de Yumbo (Valle del Cauca)"
"El corregimiento de Mulaló - Yumbo, Valle del Cauca"
"Mulaló, Yumbo, Valle del Cauca"
"Mulaló, Municipio de Yumbo, en el Valle del Cauca"
"Corregimiento de Mulaló, Municipio de Yumbo, Departamento del Valle del
Cauca"
"Corregimiento de Mulaló, Municipio de Yumbo, Departamento de Valle del
Cauca"
"Corregimiento de Mulaló, Municipio de Yumbo, en el Valle del Cauca"
"Corregimiento de Mulaló, Municipio de Yumbo, en el Valle del Cauca"
...

All of those are equivalent.

I want to get rid of articles ("el", "la", "los", "las"), prepositions
("de", "del"), and other synonyms (e.g. "en" and "jurisdicción", "ubicado
en") so that I can compare analyzed queries with some pre-generated (few)
cases I can handle from my original JSON docs.

Thanks for the link, the only caveat I see is (of course) to figure out
the cutoff_frequency. Additionally, There are other very common words in
my index I wouldn't like to overlook. For instance, a place type such as
"municipio" (municipality) is the second level in the place hierarchy, so
it could appear in any other place from the third level down the hierarchy.
The sample data I mentioned above is a third level place.

2014-08-28 13:55 GMT-05:00 Itamar Syn-Hershko itamar@code972.com:

http://www.elasticsearch.org/blog/stop-stopping-stop-words-a-look-at-common-terms-query/

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CANaz7mx0tqxJsdbHgw9JONUFLWDSW7zdvtA%3DA%2B-yUV%3DN69kXzg%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CANaz7mx0tqxJsdbHgw9JONUFLWDSW7zdvtA%3DA%2B-yUV%3DN69kXzg%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zt6M_Q%3DBbqPvzBNA6Zy6m%2Bx6SDgvstK5avHW_Kr2oYMzg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(germap) #7

Thanks Itamar,

actually, I'm planning the place hierarchy search to be part of a simple
Search API, rather than only front-end functionality (such as
autocompletion).

Users would seldom type all those words to search for a place, but all
those ways to express place hierarchy that I wrote before in this thread,
could be frequently found in digital newspaper, articles, and text in
general. I'd like to support them all.

Do you think for my Search API I could stick making requests to the
_suggest endpoint instead of to _search?

In my (short) experience with the completion suggester, I've seen lack of
flexibility for relevance. For instance, if fuzzy is enabled, I wasn't able
to give a higher score to exact matches than to fuzzy matches. I can do so
by using the _search endpoint, though.

2014-08-28 14:32 GMT-05:00 Itamar Syn-Hershko itamar@code972.com:

Take a look at suggesters - they are meant for that plus they are more
performant! http://www.elasticsearch.org/blog/you-complete-me/

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Thu, Aug 28, 2014 at 10:22 PM, Germán Carrillo <
carrillo.german@gmail.com> wrote:

The use case I'm addressing right now is searching place hierarchies
(that could include place types as well). In my country, you can specify
place hierarchy in several ways. For instance:

"El corregimiento de Mulaló, jurisdicción del municipio de Yumbo (Valle
del Cauca)"
"El corregimiento de Mulaló, en jurisdicción del municipio de Yumbo del
Valle del Cauca"
"El corregimiento de Mulaló, ubicado en Yumbo, Valle del Cauca"
"El corregimiento de Mulaló, en Yumbo, Valle del Cauca"
"El corregimiento de Mulaló, en el municipio de Yumbo (Valle del Cauca)"
"El corregimiento de Mulaló - Yumbo, Valle del Cauca"
"Mulaló, Yumbo, Valle del Cauca"
"Mulaló, Municipio de Yumbo, en el Valle del Cauca"
"Corregimiento de Mulaló, Municipio de Yumbo, Departamento del Valle del
Cauca"
"Corregimiento de Mulaló, Municipio de Yumbo, Departamento de Valle del
Cauca"
"Corregimiento de Mulaló, Municipio de Yumbo, en el Valle del Cauca"
"Corregimiento de Mulaló, Municipio de Yumbo, en el Valle del Cauca"
...

All of those are equivalent.

I want to get rid of articles ("el", "la", "los", "las"), prepositions
("de", "del"), and other synonyms (e.g. "en" and "jurisdicción", "ubicado
en") so that I can compare analyzed queries with some pre-generated (few)
cases I can handle from my original JSON docs.

Thanks for the link, the only caveat I see is (of course) to figure out
the cutoff_frequency. Additionally, There are other very common words in
my index I wouldn't like to overlook. For instance, a place type such as
"municipio" (municipality) is the second level in the place hierarchy, so
it could appear in any other place from the third level down the hierarchy.
The sample data I mentioned above is a third level place.

2014-08-28 13:55 GMT-05:00 Itamar Syn-Hershko itamar@code972.com:

http://www.elasticsearch.org/blog/stop-stopping-stop-words-a-look-at-common-terms-query/

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CANaz7mx0tqxJsdbHgw9JONUFLWDSW7zdvtA%3DA%2B-yUV%3DN69kXzg%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CANaz7mx0tqxJsdbHgw9JONUFLWDSW7zdvtA%3DA%2B-yUV%3DN69kXzg%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zt6M_Q%3DBbqPvzBNA6Zy6m%2Bx6SDgvstK5avHW_Kr2oYMzg%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zt6M_Q%3DBbqPvzBNA6Zy6m%2Bx6SDgvstK5avHW_Kr2oYMzg%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CANaz7my9QAmrOx3hJY0a0y4qd6UF15cNK5VfPMWbfPLS7RotJg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Itamar Syn-Hershko) #8

You could use the suggesters, sure, but it all really depends on the actual
data and how you expect your queries to work (for instance, how exactly
important the hierarchy really is)

Another option for example would be to use synonyms: El corregimiento de
Mulaló, El corregimiento de Mulaló - Yumbo <-- synonyms to Mulaló (multiple
tokens at the same position) etc

And then you would use a tokenizer normally (and tokenize on commas, for
example)

Then you still lose the full-text search capabilities but in exchange for
more precision (and more setup work on your part)

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Thu, Aug 28, 2014 at 11:31 PM, Germán Carrillo <carrillo.german@gmail.com

wrote:

Thanks Itamar,

actually, I'm planning the place hierarchy search to be part of a simple
Search API, rather than only front-end functionality (such as
autocompletion).

Users would seldom type all those words to search for a place, but all
those ways to express place hierarchy that I wrote before in this thread,
could be frequently found in digital newspaper, articles, and text in
general. I'd like to support them all.

Do you think for my Search API I could stick making requests to the
_suggest endpoint instead of to _search?

In my (short) experience with the completion suggester, I've seen lack of
flexibility for relevance. For instance, if fuzzy is enabled, I wasn't able
to give a higher score to exact matches than to fuzzy matches. I can do so
by using the _search endpoint, though.

2014-08-28 14:32 GMT-05:00 Itamar Syn-Hershko itamar@code972.com:

Take a look at suggesters - they are meant for that plus they are more
performant! http://www.elasticsearch.org/blog/you-complete-me/

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Thu, Aug 28, 2014 at 10:22 PM, Germán Carrillo <
carrillo.german@gmail.com> wrote:

The use case I'm addressing right now is searching place hierarchies
(that could include place types as well). In my country, you can specify
place hierarchy in several ways. For instance:

"El corregimiento de Mulaló, jurisdicción del municipio de Yumbo (Valle
del Cauca)"
"El corregimiento de Mulaló, en jurisdicción del municipio de Yumbo del
Valle del Cauca"
"El corregimiento de Mulaló, ubicado en Yumbo, Valle del Cauca"
"El corregimiento de Mulaló, en Yumbo, Valle del Cauca"
"El corregimiento de Mulaló, en el municipio de Yumbo (Valle del Cauca)"
"El corregimiento de Mulaló - Yumbo, Valle del Cauca"
"Mulaló, Yumbo, Valle del Cauca"
"Mulaló, Municipio de Yumbo, en el Valle del Cauca"
"Corregimiento de Mulaló, Municipio de Yumbo, Departamento del Valle del
Cauca"
"Corregimiento de Mulaló, Municipio de Yumbo, Departamento de Valle del
Cauca"
"Corregimiento de Mulaló, Municipio de Yumbo, en el Valle del Cauca"
"Corregimiento de Mulaló, Municipio de Yumbo, en el Valle del Cauca"
...

All of those are equivalent.

I want to get rid of articles ("el", "la", "los", "las"), prepositions
("de", "del"), and other synonyms (e.g. "en" and "jurisdicción", "ubicado
en") so that I can compare analyzed queries with some pre-generated (few)
cases I can handle from my original JSON docs.

Thanks for the link, the only caveat I see is (of course) to figure out
the cutoff_frequency. Additionally, There are other very common words
in my index I wouldn't like to overlook. For instance, a place type such as
"municipio" (municipality) is the second level in the place hierarchy, so
it could appear in any other place from the third level down the hierarchy.
The sample data I mentioned above is a third level place.

2014-08-28 13:55 GMT-05:00 Itamar Syn-Hershko itamar@code972.com:

http://www.elasticsearch.org/blog/stop-stopping-stop-words-a-look-at-common-terms-query/

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CANaz7mx0tqxJsdbHgw9JONUFLWDSW7zdvtA%3DA%2B-yUV%3DN69kXzg%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CANaz7mx0tqxJsdbHgw9JONUFLWDSW7zdvtA%3DA%2B-yUV%3DN69kXzg%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zt6M_Q%3DBbqPvzBNA6Zy6m%2Bx6SDgvstK5avHW_Kr2oYMzg%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zt6M_Q%3DBbqPvzBNA6Zy6m%2Bx6SDgvstK5avHW_Kr2oYMzg%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CANaz7my9QAmrOx3hJY0a0y4qd6UF15cNK5VfPMWbfPLS7RotJg%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CANaz7my9QAmrOx3hJY0a0y4qd6UF15cNK5VfPMWbfPLS7RotJg%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZvtcxwxvSUm9Ew6Kz%2BQErMMj63gtGKmWmogThUu4E8CsQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Ivan Brusic) #9

You mentioned in your original post "I'd like to obtain the original text
without stop words"

The stopword-less phrase will indeed be present in the index after the
analysis phrase, however, when you ask for this content back as a result of
a query, the original text will be returned. What is indexed is not
necessarily what is stored/returned.

Cheers,

Ivan

On Thu, Aug 28, 2014 at 12:30 PM, Germán Carrillo <carrillo.german@gmail.com

wrote:

Thanks Ivan,

do you mean what I obtain from a request such as

curl -XGET
'localhost:9200/_analyze?tokenizer=keyword&filters=lowercase,my_ascii_folding,my_stopwords'
-d 'El corregimiento de Mulaló, jurisdicción del municipio de Yumbo
(Valle del Cauca)'

is not what will be present in the index after the analysis process? If
so, how could I check whether the stop words filter is being (will be)
applied to a sample phrase?

2014-08-28 14:03 GMT-05:00 Ivan Brusic ivan@brusic.com:

Also note that the content returned will still contain the stop words.
Only the inverted index will contain the stopword-less content.

--
Ivan

On Thu, Aug 28, 2014 at 11:55 AM, Itamar Syn-Hershko itamar@code972.com
wrote:

What would be the usecase for such a process (removing stop words
without tokenization)?

This may be a good read btw:
http://www.elasticsearch.org/blog/stop-stopping-stop-words-a-look-at-common-terms-query/

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Thu, Aug 28, 2014 at 9:48 PM, German Carrillo <
carrillo.german@gmail.com> wrote:

Hi all,

I'm looking for a way to remove stop words from tokens returned by a
keyword tokenizer, i.e., I'd like to obtain the original text without stop
words after the analysis process.

Sample data looks like: "El corregimiento de
Mulaló, jurisdicción del municipio de Yumbo (Valle del Cauca)"
After the lowercase token filter: "el corregimiento de
mulaló, jurisdicción del municipio de yumbo (valle del cauca)"
After the ascii folding token filter: "el corregimiento de
mulalo, jurisdiccion del municipio de yumbo (valle del cauca)"
After removing stop words: "corregimiento mulalo,
municipio yumbo (valle cauca)"

The stop words (currently) are: ["la", "el", "de", "del", "los",
"las", "jurisdiccion"]

Is the pattern replace token filter the only (or best) way to go for
such a task?

I'd really like to avoid writing custom regular expressions rather than
specifying a stop words list, which I know would work perfectly fine for
other tokenizers.

Regards,

Germán

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/038ff037-ccf3-4aca-b0c0-bb421531c495%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/038ff037-ccf3-4aca-b0c0-bb421531c495%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zu%2BJGsL7Srsg7inbs3TkejOqp4jFZ1op-18WfiT3VoGOQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zu%2BJGsL7Srsg7inbs3TkejOqp4jFZ1op-18WfiT3VoGOQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCJAM-4nJAKjUix7GvT9766%2B5si_z76txfnt-S-BTJqBw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCJAM-4nJAKjUix7GvT9766%2B5si_z76txfnt-S-BTJqBw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CANaz7mxuoDv3cV83nUgr-SXentuwfBcs3bX8oLMA_tvBd40bWA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CANaz7mxuoDv3cV83nUgr-SXentuwfBcs3bX8oLMA_tvBd40bWA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCWTx%2B%2BSPvA_wzXoyP_jjzaaekGoRsCeb2zZ7ps55vYnA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(germap) #10

Ivan, yes, I'm aware I would obtain another text, that's fine. Even more,
my docs have a "display" field to be returned to users after a search. For
the example given above, the display value would be something like:
"Mulaló, Yumbo, Valle del Cauca."

Itamar, I've actually considered several options. I think a synonym file
would be too big. I gave you 11 equivalent terms (you might've noticed I
could have continued to give you around 30 equivalent ways), but I didn't
mention place names (alone) have their corresponding synonyms, alternate
names, abbreviations, and vernacular names. There could be 10k different
places (docs) in the index. :smiley: Also, taking into account every single case
into the synonym file seems to be sub-optimal. Really, I intend to
normalize a large number of ways of expressing place hierarchy into a few
ways. Otherwise I'd have to build very large lists for each place I add to
the index, and nothing prevents I'm missing a weird case. BTW, handling
hierarchy is a must, otherwise result disambiguation would be a nightmare
for users.

Thanks for all the discussion, it's certainly valuable to read an expert's
opinion.

Back to my very first question, is the pattern replace token filter the
only way to remove stop words from tokens obtained from a keyword tokenizer?
Are those regular expressions not very performant?

2014-08-28 15:49 GMT-05:00 Ivan Brusic ivan@brusic.com:

You mentioned in your original post "I'd like to obtain the original text
without stop words"

The stopword-less phrase will indeed be present in the index after the
analysis phrase, however, when you ask for this content back as a result of
a query, the original text will be returned. What is indexed is not
necessarily what is stored/returned.

Cheers,

Ivan

On Thu, Aug 28, 2014 at 12:30 PM, Germán Carrillo <
carrillo.german@gmail.com> wrote:

Thanks Ivan,

do you mean what I obtain from a request such as

curl -XGET
'localhost:9200/_analyze?tokenizer=keyword&filters=lowercase,my_ascii_folding,my_stopwords'
-d 'El corregimiento de Mulaló, jurisdicción del municipio de Yumbo
(Valle del Cauca)'

is not what will be present in the index after the analysis process? If
so, how could I check whether the stop words filter is being (will be)
applied to a sample phrase?

2014-08-28 14:03 GMT-05:00 Ivan Brusic ivan@brusic.com:

Also note that the content returned will still contain the stop words.
Only the inverted index will contain the stopword-less content.

--
Ivan

On Thu, Aug 28, 2014 at 11:55 AM, Itamar Syn-Hershko <itamar@code972.com

wrote:

What would be the usecase for such a process (removing stop words
without tokenization)?

This may be a good read btw:
http://www.elasticsearch.org/blog/stop-stopping-stop-words-a-look-at-common-terms-query/

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Thu, Aug 28, 2014 at 9:48 PM, German Carrillo <
carrillo.german@gmail.com> wrote:

Hi all,

I'm looking for a way to remove stop words from tokens returned by a
keyword tokenizer, i.e., I'd like to obtain the original text without stop
words after the analysis process.

Sample data looks like: "El corregimiento de
Mulaló, jurisdicción del municipio de Yumbo (Valle del Cauca)"
After the lowercase token filter: "el corregimiento de
mulaló, jurisdicción del municipio de yumbo (valle del cauca)"
After the ascii folding token filter: "el corregimiento de
mulalo, jurisdiccion del municipio de yumbo (valle del cauca)"
After removing stop words: "corregimiento mulalo,
municipio yumbo (valle cauca)"

The stop words (currently) are: ["la", "el", "de", "del", "los",
"las", "jurisdiccion"]

Is the pattern replace token filter the only (or best) way to go for
such a task?

I'd really like to avoid writing custom regular expressions rather
than specifying a stop words list, which I know would work perfectly fine
for other tokenizers.

Regards,

Germán

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/038ff037-ccf3-4aca-b0c0-bb421531c495%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/038ff037-ccf3-4aca-b0c0-bb421531c495%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zu%2BJGsL7Srsg7inbs3TkejOqp4jFZ1op-18WfiT3VoGOQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zu%2BJGsL7Srsg7inbs3TkejOqp4jFZ1op-18WfiT3VoGOQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCJAM-4nJAKjUix7GvT9766%2B5si_z76txfnt-S-BTJqBw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCJAM-4nJAKjUix7GvT9766%2B5si_z76txfnt-S-BTJqBw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CANaz7mxuoDv3cV83nUgr-SXentuwfBcs3bX8oLMA_tvBd40bWA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CANaz7mxuoDv3cV83nUgr-SXentuwfBcs3bX8oLMA_tvBd40bWA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCWTx%2B%2BSPvA_wzXoyP_jjzaaekGoRsCeb2zZ7ps55vYnA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCWTx%2B%2BSPvA_wzXoyP_jjzaaekGoRsCeb2zZ7ps55vYnA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CANaz7mx6RQ5HMS12uyH4wpAXJo2UsE5rV5L%2Bqpk98dBnrjkv8w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Ivan Brusic) #11

Character filters are executed before the tokenizer, so only something in
that family of filters would work if you plan to continue using the keyword
tokenizer.

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html

The mapping char filter might be a better match if you list is not in regex
form. I use the mapping char filter to remove copyright, trademark and a
whole list of other characters from my content.

Cheers,

Ivan

On Thu, Aug 28, 2014 at 2:33 PM, Germán Carrillo carrillo.german@gmail.com
wrote:

Ivan, yes, I'm aware I would obtain another text, that's fine. Even more,
my docs have a "display" field to be returned to users after a search. For
the example given above, the display value would be something like:
"Mulaló, Yumbo, Valle del Cauca."

Itamar, I've actually considered several options. I think a synonym file
would be too big. I gave you 11 equivalent terms (you might've noticed I
could have continued to give you around 30 equivalent ways), but I didn't
mention place names (alone) have their corresponding synonyms, alternate
names, abbreviations, and vernacular names. There could be 10k different
places (docs) in the index. :smiley: Also, taking into account every single case
into the synonym file seems to be sub-optimal. Really, I intend to
normalize a large number of ways of expressing place hierarchy into a few
ways. Otherwise I'd have to build very large lists for each place I add to
the index, and nothing prevents I'm missing a weird case. BTW, handling
hierarchy is a must, otherwise result disambiguation would be a nightmare
for users.

Thanks for all the discussion, it's certainly valuable to read an expert's
opinion.

Back to my very first question, is the pattern replace token filter the
only way to remove stop words from tokens obtained from a keyword tokenizer?
Are those regular expressions not very performant?

2014-08-28 15:49 GMT-05:00 Ivan Brusic ivan@brusic.com:

You mentioned in your original post "I'd like to obtain the original
text without stop words"

The stopword-less phrase will indeed be present in the index after the
analysis phrase, however, when you ask for this content back as a result of
a query, the original text will be returned. What is indexed is not
necessarily what is stored/returned.

Cheers,

Ivan

On Thu, Aug 28, 2014 at 12:30 PM, Germán Carrillo <
carrillo.german@gmail.com> wrote:

Thanks Ivan,

do you mean what I obtain from a request such as

curl -XGET
'localhost:9200/_analyze?tokenizer=keyword&filters=lowercase,my_ascii_folding,my_stopwords'
-d 'El corregimiento de Mulaló, jurisdicción del municipio de Yumbo
(Valle del Cauca)'

is not what will be present in the index after the analysis process? If
so, how could I check whether the stop words filter is being (will be)
applied to a sample phrase?

2014-08-28 14:03 GMT-05:00 Ivan Brusic ivan@brusic.com:

Also note that the content returned will still contain the stop
words. Only the inverted index will contain the stopword-less content.

--
Ivan

On Thu, Aug 28, 2014 at 11:55 AM, Itamar Syn-Hershko <
itamar@code972.com> wrote:

What would be the usecase for such a process (removing stop words
without tokenization)?

This may be a good read btw:
http://www.elasticsearch.org/blog/stop-stopping-stop-words-a-look-at-common-terms-query/

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Thu, Aug 28, 2014 at 9:48 PM, German Carrillo <
carrillo.german@gmail.com> wrote:

Hi all,

I'm looking for a way to remove stop words from tokens returned by a
keyword tokenizer, i.e., I'd like to obtain the original text without stop
words after the analysis process.

Sample data looks like: "El corregimiento de
Mulaló, jurisdicción del municipio de Yumbo (Valle del Cauca)"
After the lowercase token filter: "el corregimiento de
mulaló, jurisdicción del municipio de yumbo (valle del cauca)"
After the ascii folding token filter: "el corregimiento de
mulalo, jurisdiccion del municipio de yumbo (valle del cauca)"
After removing stop words: "corregimiento mulalo,
municipio yumbo (valle cauca)"

The stop words (currently) are: ["la", "el", "de", "del", "los",
"las", "jurisdiccion"]

Is the pattern replace token filter the only (or best) way to go for
such a task?

I'd really like to avoid writing custom regular expressions rather
than specifying a stop words list, which I know would work perfectly fine
for other tokenizers.

Regards,

Germán

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/038ff037-ccf3-4aca-b0c0-bb421531c495%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/038ff037-ccf3-4aca-b0c0-bb421531c495%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zu%2BJGsL7Srsg7inbs3TkejOqp4jFZ1op-18WfiT3VoGOQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zu%2BJGsL7Srsg7inbs3TkejOqp4jFZ1op-18WfiT3VoGOQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCJAM-4nJAKjUix7GvT9766%2B5si_z76txfnt-S-BTJqBw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCJAM-4nJAKjUix7GvT9766%2B5si_z76txfnt-S-BTJqBw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CANaz7mxuoDv3cV83nUgr-SXentuwfBcs3bX8oLMA_tvBd40bWA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CANaz7mxuoDv3cV83nUgr-SXentuwfBcs3bX8oLMA_tvBd40bWA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCWTx%2B%2BSPvA_wzXoyP_jjzaaekGoRsCeb2zZ7ps55vYnA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCWTx%2B%2BSPvA_wzXoyP_jjzaaekGoRsCeb2zZ7ps55vYnA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CANaz7mx6RQ5HMS12uyH4wpAXJo2UsE5rV5L%2Bqpk98dBnrjkv8w%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CANaz7mx6RQ5HMS12uyH4wpAXJo2UsE5rV5L%2Bqpk98dBnrjkv8w%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQB7V%2B20202bENTvqbJ86%2BDaNSMLDCXpq%2B5nY6F1qa3DWA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(germap) #12

Thanks Ivan! I'll test which way fits better to my needs.

2014-08-28 17:12 GMT-05:00 Ivan Brusic ivan@brusic.com:

Character filters are executed before the tokenizer, so only something in
that family of filters would work if you plan to continue using the keyword
tokenizer.

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html

The mapping char filter might be a better match if you list is not in
regex form. I use the mapping char filter to remove copyright, trademark
and a whole list of other characters from my content.

Cheers,

Ivan

On Thu, Aug 28, 2014 at 2:33 PM, Germán Carrillo <
carrillo.german@gmail.com> wrote:

Ivan, yes, I'm aware I would obtain another text, that's fine. Even more,
my docs have a "display" field to be returned to users after a search. For
the example given above, the display value would be something like:
"Mulaló, Yumbo, Valle del Cauca."

Itamar, I've actually considered several options. I think a synonym file
would be too big. I gave you 11 equivalent terms (you might've noticed I
could have continued to give you around 30 equivalent ways), but I didn't
mention place names (alone) have their corresponding synonyms, alternate
names, abbreviations, and vernacular names. There could be 10k different
places (docs) in the index. :smiley: Also, taking into account every single case
into the synonym file seems to be sub-optimal. Really, I intend to
normalize a large number of ways of expressing place hierarchy into a few
ways. Otherwise I'd have to build very large lists for each place I add to
the index, and nothing prevents I'm missing a weird case. BTW, handling
hierarchy is a must, otherwise result disambiguation would be a nightmare
for users.

Thanks for all the discussion, it's certainly valuable to read an
expert's opinion.

Back to my very first question, is the pattern replace token filter the
only way to remove stop words from tokens obtained from a keyword tokenizer?
Are those regular expressions not very performant?

2014-08-28 15:49 GMT-05:00 Ivan Brusic ivan@brusic.com:

You mentioned in your original post "I'd like to obtain the original
text without stop words"

The stopword-less phrase will indeed be present in the index after the
analysis phrase, however, when you ask for this content back as a result of
a query, the original text will be returned. What is indexed is not
necessarily what is stored/returned.

Cheers,

Ivan

On Thu, Aug 28, 2014 at 12:30 PM, Germán Carrillo <
carrillo.german@gmail.com> wrote:

Thanks Ivan,

do you mean what I obtain from a request such as

curl -XGET
'localhost:9200/_analyze?tokenizer=keyword&filters=lowercase,my_ascii_folding,my_stopwords'
-d 'El corregimiento de Mulaló, jurisdicción del municipio de Yumbo
(Valle del Cauca)'

is not what will be present in the index after the analysis process? If
so, how could I check whether the stop words filter is being (will be)
applied to a sample phrase?

2014-08-28 14:03 GMT-05:00 Ivan Brusic ivan@brusic.com:

Also note that the content returned will still contain the stop
words. Only the inverted index will contain the stopword-less content.

--
Ivan

On Thu, Aug 28, 2014 at 11:55 AM, Itamar Syn-Hershko <
itamar@code972.com> wrote:

What would be the usecase for such a process (removing stop words
without tokenization)?

This may be a good read btw:
http://www.elasticsearch.org/blog/stop-stopping-stop-words-a-look-at-common-terms-query/

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Thu, Aug 28, 2014 at 9:48 PM, German Carrillo <
carrillo.german@gmail.com> wrote:

Hi all,

I'm looking for a way to remove stop words from tokens returned by a
keyword tokenizer, i.e., I'd like to obtain the original text without stop
words after the analysis process.

Sample data looks like: "El corregimiento de
Mulaló, jurisdicción del municipio de Yumbo (Valle del Cauca)"
After the lowercase token filter: "el corregimiento de
mulaló, jurisdicción del municipio de yumbo (valle del cauca)"
After the ascii folding token filter: "el corregimiento de
mulalo, jurisdiccion del municipio de yumbo (valle del cauca)"
After removing stop words: "corregimiento mulalo,
municipio yumbo (valle cauca)"

The stop words (currently) are: ["la", "el", "de", "del",
"los", "las", "jurisdiccion"]

Is the pattern replace token filter the only (or best) way to go for
such a task?

I'd really like to avoid writing custom regular expressions rather
than specifying a stop words list, which I know would work perfectly fine
for other tokenizers.

Regards,

Germán

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/038ff037-ccf3-4aca-b0c0-bb421531c495%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/038ff037-ccf3-4aca-b0c0-bb421531c495%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zu%2BJGsL7Srsg7inbs3TkejOqp4jFZ1op-18WfiT3VoGOQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zu%2BJGsL7Srsg7inbs3TkejOqp4jFZ1op-18WfiT3VoGOQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCJAM-4nJAKjUix7GvT9766%2B5si_z76txfnt-S-BTJqBw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCJAM-4nJAKjUix7GvT9766%2B5si_z76txfnt-S-BTJqBw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CANaz7mxuoDv3cV83nUgr-SXentuwfBcs3bX8oLMA_tvBd40bWA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CANaz7mxuoDv3cV83nUgr-SXentuwfBcs3bX8oLMA_tvBd40bWA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCWTx%2B%2BSPvA_wzXoyP_jjzaaekGoRsCeb2zZ7ps55vYnA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCWTx%2B%2BSPvA_wzXoyP_jjzaaekGoRsCeb2zZ7ps55vYnA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CANaz7mx6RQ5HMS12uyH4wpAXJo2UsE5rV5L%2Bqpk98dBnrjkv8w%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CANaz7mx6RQ5HMS12uyH4wpAXJo2UsE5rV5L%2Bqpk98dBnrjkv8w%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQB7V%2B20202bENTvqbJ86%2BDaNSMLDCXpq%2B5nY6F1qa3DWA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQB7V%2B20202bENTvqbJ86%2BDaNSMLDCXpq%2B5nY6F1qa3DWA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CANaz7mzfKXDrBtweeHmCdYjbN%2B%3DR3HWHi0NWhgXVfxnnXL57yQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #13