Stop words and Keyword tokenizer

germap · August 29, 2014, 2:49pm

Thanks Ivan! I'll test which way fits better to my needs.

2014-08-28 17:12 GMT-05:00 Ivan Brusic ivan@brusic.com:

Character filters are executed before the tokenizer, so only something in
that family of filters would work if you plan to continue using the keyword
tokenizer.

Elasticsearch Platform — Find real-time answers at scale | Elastic

The mapping char filter might be a better match if you list is not in
regex form. I use the mapping char filter to remove copyright, trademark
and a whole list of other characters from my content.

Cheers,

Ivan

On Thu, Aug 28, 2014 at 2:33 PM, Germán Carrillo <
carrillo.german@gmail.com> wrote:

Ivan, yes, I'm aware I would obtain another text, that's fine. Even more,
my docs have a "display" field to be returned to users after a search. For
the example given above, the display value would be something like:
"Mulaló, Yumbo, Valle del Cauca."

Itamar, I've actually considered several options. I think a synonym file
would be too big. I gave you 11 equivalent terms (you might've noticed I
could have continued to give you around 30 equivalent ways), but I didn't
mention place names (alone) have their corresponding synonyms, alternate
names, abbreviations, and vernacular names. There could be 10k different
places (docs) in the index. Also, taking into account every single case
into the synonym file seems to be sub-optimal. Really, I intend to
normalize a large number of ways of expressing place hierarchy into a few
ways. Otherwise I'd have to build very large lists for each place I add to
the index, and nothing prevents I'm missing a weird case. BTW, handling
hierarchy is a must, otherwise result disambiguation would be a nightmare
for users.

Thanks for all the discussion, it's certainly valuable to read an
expert's opinion.

Back to my very first question, is the pattern replace token filter the
only way to remove stop words from tokens obtained from a keyword tokenizer?
Are those regular expressions not very performant?

2014-08-28 15:49 GMT-05:00 Ivan Brusic ivan@brusic.com:

You mentioned in your original post "I'd like to obtain the original
text without stop words"

The stopword-less phrase will indeed be present in the index after the
analysis phrase, however, when you ask for this content back as a result of
a query, the original text will be returned. What is indexed is not
necessarily what is stored/returned.

Cheers,

Ivan

On Thu, Aug 28, 2014 at 12:30 PM, Germán Carrillo <
carrillo.german@gmail.com> wrote:

Thanks Ivan,

do you mean what I obtain from a request such as

curl -XGET
'localhost:9200/_analyze?tokenizer=keyword&filters=lowercase,my_ascii_folding,my_stopwords'
-d 'El corregimiento de Mulaló, jurisdicción del municipio de Yumbo
(Valle del Cauca)'

is not what will be present in the index after the analysis process? If
so, how could I check whether the stop words filter is being (will be)
applied to a sample phrase?

2014-08-28 14:03 GMT-05:00 Ivan Brusic ivan@brusic.com:

Also note that the content returned will still contain the stop
words. Only the inverted index will contain the stopword-less content.

--
Ivan

On Thu, Aug 28, 2014 at 11:55 AM, Itamar Syn-Hershko <
itamar@code972.com> wrote:

What would be the usecase for such a process (removing stop words
without tokenization)?

This may be a good read btw:
Elasticsearch Platform — Find real-time answers at scale | Elastic

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Thu, Aug 28, 2014 at 9:48 PM, German Carrillo <
carrillo.german@gmail.com> wrote:

Hi all,

I'm looking for a way to remove stop words from tokens returned by a
keyword tokenizer, i.e., I'd like to obtain the original text without stop
words after the analysis process.

Sample data looks like: "El corregimiento de
Mulaló, jurisdicción del municipio de Yumbo (Valle del Cauca)"
After the lowercase token filter: "el corregimiento de
mulaló, jurisdicción del municipio de yumbo (valle del cauca)"
After the ascii folding token filter: "el corregimiento de
mulalo, jurisdiccion del municipio de yumbo (valle del cauca)"
After removing stop words: "corregimiento mulalo,
municipio yumbo (valle cauca)"

The stop words (currently) are: ["la", "el", "de", "del",
"los", "las", "jurisdiccion"]

Is the pattern replace token filter the only (or best) way to go for
such a task?

I'd really like to avoid writing custom regular expressions rather
than specifying a stop words list, which I know would work perfectly fine
for other tokenizers.

Regards,

Germán

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/038ff037-ccf3-4aca-b0c0-bb421531c495%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/038ff037-ccf3-4aca-b0c0-bb421531c495%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zu%2BJGsL7Srsg7inbs3TkejOqp4jFZ1op-18WfiT3VoGOQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zu%2BJGsL7Srsg7inbs3TkejOqp4jFZ1op-18WfiT3VoGOQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCJAM-4nJAKjUix7GvT9766%2B5si_z76txfnt-S-BTJqBw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCJAM-4nJAKjUix7GvT9766%2B5si_z76txfnt-S-BTJqBw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CANaz7mxuoDv3cV83nUgr-SXentuwfBcs3bX8oLMA_tvBd40bWA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CANaz7mxuoDv3cV83nUgr-SXentuwfBcs3bX8oLMA_tvBd40bWA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCWTx%2B%2BSPvA_wzXoyP_jjzaaekGoRsCeb2zZ7ps55vYnA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCWTx%2B%2BSPvA_wzXoyP_jjzaaekGoRsCeb2zZ7ps55vYnA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CANaz7mx6RQ5HMS12uyH4wpAXJo2UsE5rV5L%2Bqpk98dBnrjkv8w%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CANaz7mx6RQ5HMS12uyH4wpAXJo2UsE5rV5L%2Bqpk98dBnrjkv8w%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQB7V%2B20202bENTvqbJ86%2BDaNSMLDCXpq%2B5nY6F1qa3DWA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQB7V%2B20202bENTvqbJ86%2BDaNSMLDCXpq%2B5nY6F1qa3DWA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CANaz7mzfKXDrBtweeHmCdYjbN%2B%3DR3HWHi0NWhgXVfxnnXL57yQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Prevent some words from being "stopped" Elasticsearch	9	1168	July 6, 2017
Stop words not used by the analyzer Elasticsearch	5	673	July 6, 2017
Search with stemming and stopwords (german) Elasticsearch	9	3526	July 6, 2017
Protect some words when tokenizing Elasticsearch	8	2138	July 6, 2017
Exception while creating a custom analyzer Elasticsearch	10	582	July 6, 2017

Stop words and Keyword tokenizer

Related topics