Searching for "foo" should also find occurrence of "foo.bar"


(Marian Steinbach) #1

We have ElasticSearch 1.5 set up with a very simple mapping to perform full
text search in our docs (https://docs.giantswarm.io/). When searching for
"swarmvars" we get no hits, although "swarmvars.json" appears in documents.

The field "text" is used as a catch-all field for all searchable content
(title, document body, keywords). Here is the mapping:

"properties": {
...,
"text": {
"type": "string",
"store": true,
"index": "analyzed",
"term_vector": "with_positions_offsets",
"analyzer": "english",
}
}

When using the "english" analyzer on the text "Text containing
swarmvars.json and more", the result are these tokens:

text
contain
swarmvars.json
more

Having the token "swarmvars.json" is fine. What I need are two additional
tokens "swarmvars" and "json". How can I achieve that?

I was looking into creating a custom tokenizer, but I was unable to get it
to work (errors when applying the settings) and also I was unable to find
an example, no matter how I searched.

Thanks!

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/85a03096-ae33-4517-8eab-6f2be4da73ed%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(David Pilato) #2

I would probably go with a Pattern Tokenizer and define whatever regex you need.
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-tokenizer.html https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-tokenizer.html

The standard one is more for english text which means that a dot need to have a space after it in order to be considered as a break between two tokens.

Make sense?

--
David Pilato - Developer | Evangelist
elastic.co
@dadoonet https://twitter.com/dadoonet | @elasticsearchfr https://twitter.com/elasticsearchfr | @scrutmydocs https://twitter.com/scrutmydocs

Le 29 mai 2015 à 09:39, Marian Steinbach marian.steinbach@gmail.com a écrit :

We have ElasticSearch 1.5 set up with a very simple mapping to perform full text search in our docs (https://docs.giantswarm.io/). When searching for "swarmvars" we get no hits, although "swarmvars.json" appears in documents.

The field "text" is used as a catch-all field for all searchable content (title, document body, keywords). Here is the mapping:

"properties": {
...,
"text": {
"type": "string",
"store": true,
"index": "analyzed",
"term_vector": "with_positions_offsets",
"analyzer": "english",
}
}

When using the "english" analyzer on the text "Text containing swarmvars.json and more", the result are these tokens:

text
contain
swarmvars.json
more

Having the token "swarmvars.json" is fine. What I need are two additional tokens "swarmvars" and "json". How can I achieve that?

I was looking into creating a custom tokenizer, but I was unable to get it to work (errors when applying the settings) and also I was unable to find an example, no matter how I searched.

Thanks!

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/ https://discuss.elastic.co/

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/85a03096-ae33-4517-8eab-6f2be4da73ed%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/85a03096-ae33-4517-8eab-6f2be4da73ed%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout https://groups.google.com/d/optout.

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8B2107E8-4E9A-47FA-BFE0-FE36FF9FBF1C%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.


(Marian Steinbach) #3

Thanks for the reply! However, it doesn't make sense to me directly.

If I use the dot as an additional seperator, I will end up with the tokens
"swarmvars" and "json", but not "swarmvars.json". Right?

Am Freitag, 29. Mai 2015 10:47:56 UTC+2 schrieb David Pilato:

I would probably go with a Pattern Tokenizer and define whatever regex you
need.

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-tokenizer.html

The standard one is more for english text which means that a dot need to
have a space after it in order to be considered as a break between two
tokens.

Make sense?

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/44d85c90-acad-43b9-a082-6343395f19c5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(David Pilato) #4

Yes. Because « Hello. How are you? » is a sentence that can be broken in « hello », « how », « are », « you ».
But in « I paid it 2.50 euros », I would most likely keep « 2.50 » as a whole token.

--
David Pilato - Developer | Evangelist
elastic.co
@dadoonet https://twitter.com/dadoonet | @elasticsearchfr https://twitter.com/elasticsearchfr | @scrutmydocs https://twitter.com/scrutmydocs

Le 29 mai 2015 à 10:59, Marian Steinbach marian.steinbach@gmail.com a écrit :

Thanks for the reply! However, it doesn't make sense to me directly.

If I use the dot as an additional seperator, I will end up with the tokens "swarmvars" and "json", but not "swarmvars.json". Right?

Am Freitag, 29. Mai 2015 10:47:56 UTC+2 schrieb David Pilato:
I would probably go with a Pattern Tokenizer and define whatever regex you need.
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-tokenizer.html https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-tokenizer.html

The standard one is more for english text which means that a dot need to have a space after it in order to be considered as a break between two tokens.

Make sense?

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/ https://discuss.elastic.co/

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/44d85c90-acad-43b9-a082-6343395f19c5%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/44d85c90-acad-43b9-a082-6343395f19c5%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout https://groups.google.com/d/optout.

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/DD60ACAE-9659-43F1-AF10-6517D0D79DEF%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.


(Marian Steinbach) #5

Am Freitag, 29. Mai 2015 11:02:25 UTC+2 schrieb David Pilato:

Yes. Because « Hello. How are you? » is a sentence that can be broken in
« hello », « how », « are », « you ».
But in « I paid it 2.50 euros », I would most likely keep « 2.50 » as a
whole token.

So far, so easy. And my question is now: From a text "foo.bar", how can I
generate ALL of the following tokens?

foo
bar
foo.bar

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e7acf73b-1e15-431c-bee7-1b5f726fb69d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(David Pilato) #6

I would use 2 analyzers and multi field: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-core-types.html#_multi_fields_3 https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-core-types.html#_multi_fields_3

--
David Pilato - Developer | Evangelist
elastic.co
@dadoonet https://twitter.com/dadoonet | @elasticsearchfr https://twitter.com/elasticsearchfr | @scrutmydocs https://twitter.com/scrutmydocs

Le 29 mai 2015 à 11:11, Marian Steinbach marian.steinbach@gmail.com a écrit :

Am Freitag, 29. Mai 2015 11:02:25 UTC+2 schrieb David Pilato:
Yes. Because « Hello. How are you? » is a sentence that can be broken in « hello », « how », « are », « you ».
But in « I paid it 2.50 euros », I would most likely keep « 2.50 » as a whole token.

So far, so easy. And my question is now: From a text "foo.bar", how can I generate ALL of the following tokens?

foo
bar
foo.bar

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/ https://discuss.elastic.co/

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e7acf73b-1e15-431c-bee7-1b5f726fb69d%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/e7acf73b-1e15-431c-bee7-1b5f726fb69d%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout https://groups.google.com/d/optout.

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8F8EF922-9156-4DD9-98DA-1D1B5ECF3929%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.


(system) #7