Analyzer, Fuzzy Query?

maik2102 · July 24, 2012, 7:42am

Hi toghether,

I'm wondering if its possible for elasticsearch to solve the following
problems:

Productname contains "Media Player", Customer searches for "mediaplayer"
(0 hits) or "media player" (lots of hits)
Productname contains "dl380", Customer searches for "dl 380" (0 hits) or
"dl380" (lots of hits)

As today the name is analyzed with the standard analyzer and its queried by
a ftl query.

How to analyze the string or query the index to get similar results for
both searches, with blank and without it?
I know synonyms, but I hope there is a better, more general solution.

Thanks in advance
Greetings
maik

simonw_2 · July 24, 2012, 7:04pm

hey,

in your case I'd likely use a shingle filter that builds token n-grams for
you. ie. the document "my super media player" would be filtered into "my",
"mysuper", "super", "supermedia", "media", "mediaplayer"

here is a
link: Elasticsearch Platform — Find real-time answers at scale | Elastic

set max_shingle_size = 2 & output_unigrams = true (that is actually the
default)

this would match for "mediaplayer" as well as "media player" & your dl380
problem woudl be solved as well.
This might create a ton more tokens but should work just fine!

simon

On Tuesday, July 24, 2012 9:42:52 AM UTC+2, maik wrote:

Hi toghether,

I'm wondering if its possible for elasticsearch to solve the following
problems:

Productname contains "Media Player", Customer searches for
"mediaplayer" (0 hits) or "media player" (lots of hits)

Productname contains "dl380", Customer searches for "dl 380" (0 hits)
or "dl380" (lots of hits)

As today the name is analyzed with the standard analyzer and its queried
by a ftl query.

How to analyze the string or query the index to get similar results for
both searches, with blank and without it?
I know synonyms, but I hope there is a better, more general solution.

Thanks in advance
Greetings
maik

maik2102 · July 25, 2012, 5:53am

Hi simon,

thank you for your help.

I read the documentation of the shingle filter but didn't find that it will
create "media player" as well as "mediaplayer".
In the example it only catches 2 words WITH the blank between them.

Did I get it right?

maik

On Tuesday, July 24, 2012 9:04:31 PM UTC+2, simonw wrote:

hey,

in your case I'd likely use a shingle filter that builds token n-grams for
you. ie. the document "my super media player" would be filtered into "my",
"mysuper", "super", "supermedia", "media", "mediaplayer"

here is a link:
Elasticsearch Platform — Find real-time answers at scale | Elastic

set max_shingle_size = 2 & output_unigrams = true (that is actually the
default)

this would match for "mediaplayer" as well as "media player" & your dl380
problem woudl be solved as well.
This might create a ton more tokens but should work just fine!

simon

On Tuesday, July 24, 2012 9:42:52 AM UTC+2, maik wrote:

Hi toghether,

I'm wondering if its possible for elasticsearch to solve the following
problems:

Productname contains "Media Player", Customer searches for
"mediaplayer" (0 hits) or "media player" (lots of hits)

Productname contains "dl380", Customer searches for "dl 380" (0 hits)
or "dl380" (lots of hits)

As today the name is analyzed with the standard analyzer and its queried
by a ftl query.

How to analyze the string or query the index to get similar results for
both searches, with blank and without it?
I know synonyms, but I hope there is a better, more general solution.

Thanks in advance
Greetings
maik

simonw_2 · July 25, 2012, 6:45am

hey you are right,

Elasticsearch doesn't expose all the functionality the shingle filter has
like specifying the token separator etc. I will open an issue and add the
functionality you need so you can specify a token separator instead of a
blank.

simon

On Wednesday, July 25, 2012 7:53:09 AM UTC+2, maik wrote:

Hi simon,

thank you for your help.

I read the documentation of the shingle filter but didn't find that it
will create "media player" as well as "mediaplayer".
In the example it only catches 2 words WITH the blank between them.

Did I get it right?

maik

On Tuesday, July 24, 2012 9:04:31 PM UTC+2, simonw wrote:

hey,

in your case I'd likely use a shingle filter that builds token n-grams
for you. ie. the document "my super media player" would be filtered into
"my", "mysuper", "super", "supermedia", "media", "mediaplayer"

here is a link:
Elasticsearch Platform — Find real-time answers at scale | Elastic

set max_shingle_size = 2 & output_unigrams = true (that is actually the
default)

this would match for "mediaplayer" as well as "media player" & your dl380
problem woudl be solved as well.
This might create a ton more tokens but should work just fine!

simon

On Tuesday, July 24, 2012 9:42:52 AM UTC+2, maik wrote:

Hi toghether,

I'm wondering if its possible for elasticsearch to solve the following
problems:

Productname contains "Media Player", Customer searches for
"mediaplayer" (0 hits) or "media player" (lots of hits)

Productname contains "dl380", Customer searches for "dl 380" (0 hits)
or "dl380" (lots of hits)

As today the name is analyzed with the standard analyzer and its queried
by a ftl query.

How to analyze the string or query the index to get similar results for
both searches, with blank and without it?
I know synonyms, but I hope there is a better, more general solution.

Thanks in advance
Greetings
maik

maik2102 · July 25, 2012, 6:49am

Hi Simon,

sound good! Thank you.

Nevertheless, i tried the shingle filter, and it creates "better" results.
I turned on explanations and while searching for "dl 380" the explanation
says, he found the "dl380" in one of my fields.
So I think the filter still replaces the whitespace?

For now, the results are ok for me.
But if its not that big work, more functionality isn't that bad I think

Greetings
maik

On Wednesday, July 25, 2012 8:45:24 AM UTC+2, simonw wrote:

hey you are right,

Elasticsearch doesn't expose all the functionality the shingle filter has
like specifying the token separator etc. I will open an issue and add the
functionality you need so you can specify a token separator instead of a
blank.

simon

On Wednesday, July 25, 2012 7:53:09 AM UTC+2, maik wrote:

Hi simon,

thank you for your help.

I read the documentation of the shingle filter but didn't find that it
will create "media player" as well as "mediaplayer".
In the example it only catches 2 words WITH the blank between them.

Did I get it right?

maik

On Tuesday, July 24, 2012 9:04:31 PM UTC+2, simonw wrote:

hey,

in your case I'd likely use a shingle filter that builds token n-grams
for you. ie. the document "my super media player" would be filtered into
"my", "mysuper", "super", "supermedia", "media", "mediaplayer"

here is a link:
Elasticsearch Platform — Find real-time answers at scale | Elastic

set max_shingle_size = 2 & output_unigrams = true (that is actually the
default)

this would match for "mediaplayer" as well as "media player" & your
dl380 problem woudl be solved as well.
This might create a ton more tokens but should work just fine!

simon

On Tuesday, July 24, 2012 9:42:52 AM UTC+2, maik wrote:

Hi toghether,

I'm wondering if its possible for elasticsearch to solve the following
problems:

Productname contains "Media Player", Customer searches for
"mediaplayer" (0 hits) or "media player" (lots of hits)

Productname contains "dl380", Customer searches for "dl 380"
(0 hits) or "dl380" (lots of hits)

As today the name is analyzed with the standard analyzer and its
queried by a ftl query.

How to analyze the string or query the index to get similar results for
both searches, with blank and without it?
I know synonyms, but I hope there is a better, more general solution.

Thanks in advance
Greetings
maik

simonw_2 · July 25, 2012, 6:53am

On Wednesday, July 25, 2012 8:49:15 AM UTC+2, maik wrote:

Hi Simon,

sound good! Thank you.

Nevertheless, i tried the shingle filter, and it creates "better" results.
I turned on explanations and while searching for "dl 380" the explanation
says, he found the "dl380" in one of my fields.
So I think the filter still replaces the whitespace?

hmm I am not sure If I understand this can you post the explain output?

I opened ShingleTokenFilterFactory doesn't expose all relevant settings · Issue #2116 · elastic/elasticsearch · GitHub for this

simon

For now, the results are ok for me.
But if its not that big work, more functionality isn't that bad I think

Greetings
maik

On Wednesday, July 25, 2012 8:45:24 AM UTC+2, simonw wrote:

hey you are right,

Elasticsearch doesn't expose all the functionality the shingle filter has
like specifying the token separator etc. I will open an issue and add the
functionality you need so you can specify a token separator instead of a
blank.

simon

On Wednesday, July 25, 2012 7:53:09 AM UTC+2, maik wrote:

Hi simon,

thank you for your help.

I read the documentation of the shingle filter but didn't find that it
will create "media player" as well as "mediaplayer".
In the example it only catches 2 words WITH the blank between them.

Did I get it right?

maik

On Tuesday, July 24, 2012 9:04:31 PM UTC+2, simonw wrote:

hey,

in your case I'd likely use a shingle filter that builds token n-grams
for you. ie. the document "my super media player" would be filtered into
"my", "mysuper", "super", "supermedia", "media", "mediaplayer"

here is a link:
Elasticsearch Platform — Find real-time answers at scale | Elastic

set max_shingle_size = 2 & output_unigrams = true (that is actually the
default)

this would match for "mediaplayer" as well as "media player" & your
dl380 problem woudl be solved as well.
This might create a ton more tokens but should work just fine!

simon

On Tuesday, July 24, 2012 9:42:52 AM UTC+2, maik wrote:

Hi toghether,

I'm wondering if its possible for elasticsearch to solve the following
problems:

Productname contains "Media Player", Customer searches for
"mediaplayer" (0 hits) or "media player" (lots of hits)

Productname contains "dl380", Customer searches for "dl 380"
(0 hits) or "dl380" (lots of hits)

As today the name is analyzed with the standard analyzer and its
queried by a ftl query.

How to analyze the string or query the index to get similar results
for both searches, with blank and without it?
I know synonyms, but I hope there is a better, more general solution.

Thanks in advance
Greetings
maik

maik2102 · July 30, 2012, 12:40pm

Hi Simon,

sorry für my delayed answer, not that much time at the moment

We have products which have "dl 380" in their names. Before putting the
shingle filter into the analysis process, with "dl380" you didn't find them.
With the single filter you can find them.

I don't have an explanation output at my hands, but it said it found
"dl380" in the name.
So I think it already works in the way I need it.

If you're interested in further information, please let me know.

Greetings
maik

On Wednesday, July 25, 2012 8:53:01 AM UTC+2, simonw wrote:

On Wednesday, July 25, 2012 8:49:15 AM UTC+2, maik wrote:

Hi Simon,

sound good! Thank you.

Nevertheless, i tried the shingle filter, and it creates "better" results.
I turned on explanations and while searching for "dl 380" the explanation
says, he found the "dl380" in one of my fields.
So I think the filter still replaces the whitespace?

hmm I am not sure If I understand this can you post the explain output?

I opened ShingleTokenFilterFactory doesn't expose all relevant settings · Issue #2116 · elastic/elasticsearch · GitHub for
this

simon

For now, the results are ok for me.
But if its not that big work, more functionality isn't that bad I think

Greetings
maik

On Wednesday, July 25, 2012 8:45:24 AM UTC+2, simonw wrote:

hey you are right,

Elasticsearch doesn't expose all the functionality the shingle filter
has like specifying the token separator etc. I will open an issue and add
the functionality you need so you can specify a token separator instead of
a blank.

simon

On Wednesday, July 25, 2012 7:53:09 AM UTC+2, maik wrote:

Hi simon,

thank you for your help.

I read the documentation of the shingle filter but didn't find that it
will create "media player" as well as "mediaplayer".
In the example it only catches 2 words WITH the blank between them.

Did I get it right?

maik

On Tuesday, July 24, 2012 9:04:31 PM UTC+2, simonw wrote:

hey,

in your case I'd likely use a shingle filter that builds token n-grams
for you. ie. the document "my super media player" would be filtered into
"my", "mysuper", "super", "supermedia", "media", "mediaplayer"

here is a link:
Elasticsearch Platform — Find real-time answers at scale | Elastic

set max_shingle_size = 2 & output_unigrams = true (that is actually
the default)

this would match for "mediaplayer" as well as "media player" & your
dl380 problem woudl be solved as well.
This might create a ton more tokens but should work just fine!

simon

On Tuesday, July 24, 2012 9:42:52 AM UTC+2, maik wrote:

Hi toghether,

I'm wondering if its possible for elasticsearch to solve the
following problems:

Productname contains "Media Player", Customer searches for
"mediaplayer" (0 hits) or "media player" (lots of hits)

Productname contains "dl380", Customer searches for "dl 380"
(0 hits) or "dl380" (lots of hits)

As today the name is analyzed with the standard analyzer and its
queried by a ftl query.

How to analyze the string or query the index to get similar results
for both searches, with blank and without it?
I know synonyms, but I hope there is a better, more general solution.

Thanks in advance
Greetings
maik