Beginner's question about searching


(Enrique Medina Montenegro) #1

Hi,

I know this may sound as a beginner's question to searching in ES, but I
just want to make sure I get it.

Which is the type of query that I should use on searching coincidence terms
(for instance, I have indexed lots of products, and I want to look by
words):

a) Term --> But in the doc it says that this one is used for "not analyzed"
fields.
b) Wildcard --> Seems to work fine, but for some reason it doesn't return
the exact matches. For example, if you're looking for "iphone" and then you
define the wildcard value as "iphone*", then it won't return any product
with "iphone" whatsoever.
c) Fuzzy --> This seems to be the closest to a typical search by words. So
if I search with "iphone", then it will also find "phone", "telephone", even
"microphone".

So which one should be used in each case? Could someone post some simple
usage scenarios for each type of query?

Thanks.


(Enrique Medina Montenegro) #2

Any feedback on the below?

On Wed, Feb 2, 2011 at 12:36 AM, Enrique Medina Montenegro <
e.medina.m@gmail.com> wrote:

Hi,

I know this may sound as a beginner's question to searching in ES, but I
just want to make sure I get it.

Which is the type of query that I should use on searching coincidence terms
(for instance, I have indexed lots of products, and I want to look by
words):

a) Term --> But in the doc it says that this one is used for "not analyzed"
fields.
b) Wildcard --> Seems to work fine, but for some reason it doesn't return
the exact matches. For example, if you're looking for "iphone" and then you
define the wildcard value as "iphone*", then it won't return any product
with "iphone" whatsoever.
c) Fuzzy --> This seems to be the closest to a typical search by words. So
if I search with "iphone", then it will also find "phone", "telephone", even
"microphone".

So which one should be used in each case? Could someone post some simple
usage scenarios for each type of query?

Thanks.


(Shay Banon) #3

The simplest is to use the field query, which also analyzes the query string provided (breaks it into terms).
On Wednesday, February 2, 2011 at 1:35 PM, Enrique Medina Montenegro wrote:

Any feedback on the below?

On Wed, Feb 2, 2011 at 12:36 AM, Enrique Medina Montenegro e.medina.m@gmail.com wrote:

Hi,

I know this may sound as a beginner's question to searching in ES, but I just want to make sure I get it.

Which is the type of query that I should use on searching coincidence terms (for instance, I have indexed lots of products, and I want to look by words):

a) Term --> But in the doc it says that this one is used for "not analyzed" fields.
b) Wildcard --> Seems to work fine, but for some reason it doesn't return the exact matches. For example, if you're looking for "iphone" and then you define the wildcard value as "iphone*", then it won't return any product with "iphone" whatsoever.
c) Fuzzy --> This seems to be the closest to a typical search by words. So if I search with "iphone", then it will also find "phone", "telephone", even "microphone".

So which one should be used in each case? Could someone post some simple usage scenarios for each type of query?

Thanks.


(Enrique Medina Montenegro) #4

But Shay, the results I get with using Fuzzy compared to Field query are
quite different.

This is in Spanish, but for instance, if I select Field query, then the term
provided by the user must be as close as possible to the original term in
the document indexed. For example:

"Crema para maquillarse de uso casero"

If I use Field query with term = "maquillaje", then I don't get that result
above ("maquillaje" is a noun, whereas "maquillarse" is the action verb for
the noun).

But if I use Fuzzy query with same term, then I do get it (and probably some
noise, but with lower scores).

What do you think? I'd just like to set my user's expectations here
beforehand.

On Wed, Feb 2, 2011 at 1:18 PM, Shay Banon shay.banon@elasticsearch.comwrote:

The simplest is to use the field query, which also analyzes the query
string provided (breaks it into terms).

On Wednesday, February 2, 2011 at 1:35 PM, Enrique Medina Montenegro wrote:

Any feedback on the below?

On Wed, Feb 2, 2011 at 12:36 AM, Enrique Medina Montenegro <
e.medina.m@gmail.com> wrote:

Hi,

I know this may sound as a beginner's question to searching in ES, but I
just want to make sure I get it.

Which is the type of query that I should use on searching coincidence terms
(for instance, I have indexed lots of products, and I want to look by
words):

a) Term --> But in the doc it says that this one is used for "not analyzed"
fields.
b) Wildcard --> Seems to work fine, but for some reason it doesn't return
the exact matches. For example, if you're looking for "iphone" and then you
define the wildcard value as "iphone*", then it won't return any product
with "iphone" whatsoever.
c) Fuzzy --> This seems to be the closest to a typical search by words. So
if I search with "iphone", then it will also find "phone", "telephone", even
"microphone".

So which one should be used in each case? Could someone post some simple
usage scenarios for each type of query?

Thanks.


(Shay Banon) #5

This depends on your analyzer, which is used to break the provided text into tokens when you use field query.
On Wednesday, February 2, 2011 at 2:25 PM, Enrique Medina Montenegro wrote:

But Shay, the results I get with using Fuzzy compared to Field query are quite different.

This is in Spanish, but for instance, if I select Field query, then the term provided by the user must be as close as possible to the original term in the document indexed. For example:

"Crema para maquillarse de uso casero"

If I use Field query with term = "maquillaje", then I don't get that result above ("maquillaje" is a noun, whereas "maquillarse" is the action verb for the noun).

But if I use Fuzzy query with same term, then I do get it (and probably some noise, but with lower scores).

What do you think? I'd just like to set my user's expectations here beforehand.

On Wed, Feb 2, 2011 at 1:18 PM, Shay Banon shay.banon@elasticsearch.com wrote:

The simplest is to use the field query, which also analyzes the query string provided (breaks it into terms).
On Wednesday, February 2, 2011 at 1:35 PM, Enrique Medina Montenegro wrote:

Any feedback on the below?

On Wed, Feb 2, 2011 at 12:36 AM, Enrique Medina Montenegro e.medina.m@gmail.com wrote:

Hi,

I know this may sound as a beginner's question to searching in ES, but I just want to make sure I get it.

Which is the type of query that I should use on searching coincidence terms (for instance, I have indexed lots of products, and I want to look by words):

a) Term --> But in the doc it says that this one is used for "not analyzed" fields.
b) Wildcard --> Seems to work fine, but for some reason it doesn't return the exact matches. For example, if you're looking for "iphone" and then you define the wildcard value as "iphone*", then it won't return any product with "iphone" whatsoever.
c) Fuzzy --> This seems to be the closest to a typical search by words. So if I search with "iphone", then it will also find "phone", "telephone", even "microphone".

So which one should be used in each case? Could someone post some simple usage scenarios for each type of query?

Thanks.


(Enrique Medina Montenegro) #6

Shay,

I'm using the same analyzer that I was using before with just Lucene (with
Compass), which is:

@SuppressWarnings({ "unchecked", "rawtypes" })

public TokenStream tokenStream(String fieldName, Reader reader) {

TokenStream result = new StandardTokenizer(Version.LUCENE_30, reader);

result = new StandardFilter(result);

result = new LowerCaseFilter(result);

result = new ASCIIFoldingFilter(result); // Mi cambio

result = new StopFilter(false, result, new HashSet(Arrays.asList(
SPANISH_STOP_WORDS)));

result = new SnowballFilter(result, "Spanish");

return result;

}

A quick comparison of results showed that searches in "old" Lucene version
where not the same as in ES with the same analyzer, so that's why I started
to change the types of query, and came up with Fuzzy being the most similar
to "old" behaviour, but as Fuzzy is something which should onlu be used in
specific scenarios (due to the fuzziness of the search itself), tht's why I
wanted to double check with the list here...

On Wed, Feb 2, 2011 at 1:27 PM, Shay Banon shay.banon@elasticsearch.comwrote:

This depends on your analyzer, which is used to break the provided text
into tokens when you use field query.

On Wednesday, February 2, 2011 at 2:25 PM, Enrique Medina Montenegro wrote:

But Shay, the results I get with using Fuzzy compared to Field query are
quite different.

This is in Spanish, but for instance, if I select Field query, then the
term provided by the user must be as close as possible to the original term
in the document indexed. For example:

"Crema para maquillarse de uso casero"

If I use Field query with term = "maquillaje", then I don't get that result
above ("maquillaje" is a noun, whereas "maquillarse" is the action verb for
the noun).

But if I use Fuzzy query with same term, then I do get it (and probably
some noise, but with lower scores).

What do you think? I'd just like to set my user's expectations here
beforehand.

On Wed, Feb 2, 2011 at 1:18 PM, Shay Banon shay.banon@elasticsearch.comwrote:

The simplest is to use the field query, which also analyzes the query
string provided (breaks it into terms).

On Wednesday, February 2, 2011 at 1:35 PM, Enrique Medina Montenegro wrote:

Any feedback on the below?

On Wed, Feb 2, 2011 at 12:36 AM, Enrique Medina Montenegro <
e.medina.m@gmail.com> wrote:

Hi,

I know this may sound as a beginner's question to searching in ES, but I
just want to make sure I get it.

Which is the type of query that I should use on searching coincidence terms
(for instance, I have indexed lots of products, and I want to look by
words):

a) Term --> But in the doc it says that this one is used for "not analyzed"
fields.
b) Wildcard --> Seems to work fine, but for some reason it doesn't return
the exact matches. For example, if you're looking for "iphone" and then you
define the wildcard value as "iphone*", then it won't return any product
with "iphone" whatsoever.
c) Fuzzy --> This seems to be the closest to a typical search by words. So
if I search with "iphone", then it will also find "phone", "telephone", even
"microphone".

So which one should be used in each case? Could someone post some simple
usage scenarios for each type of query?

Thanks.


(Shay Banon) #7

Have you made sure to use a new index once you defined the analyzer and reindex the data so it will use it?
On Wednesday, February 2, 2011 at 2:37 PM, Enrique Medina Montenegro wrote:

Shay,

I'm using the same analyzer that I was using before with just Lucene (with Compass), which is:

@SuppressWarnings({ "unchecked", "rawtypes" })
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new StandardTokenizer(Version.LUCENE_30, reader);
result = new StandardFilter(result);
result = new LowerCaseFilter(result);
result = new ASCIIFoldingFilter(result); // Mi cambio
result = new StopFilter(false, result, new HashSet(Arrays.asList(SPANISH_STOP_WORDS)));
result = new SnowballFilter(result, "Spanish");
return result;
}

A quick comparison of results showed that searches in "old" Lucene version where not the same as in ES with the same analyzer, so that's why I started to change the types of query, and came up with Fuzzy being the most similar to "old" behaviour, but as Fuzzy is something which should onlu be used in specific scenarios (due to the fuzziness of the search itself), tht's why I wanted to double check with the list here...

On Wed, Feb 2, 2011 at 1:27 PM, Shay Banon shay.banon@elasticsearch.com wrote:

This depends on your analyzer, which is used to break the provided text into tokens when you use field query.
On Wednesday, February 2, 2011 at 2:25 PM, Enrique Medina Montenegro wrote:

But Shay, the results I get with using Fuzzy compared to Field query are quite different.

This is in Spanish, but for instance, if I select Field query, then the term provided by the user must be as close as possible to the original term in the document indexed. For example:

"Crema para maquillarse de uso casero"

If I use Field query with term = "maquillaje", then I don't get that result above ("maquillaje" is a noun, whereas "maquillarse" is the action verb for the noun).

But if I use Fuzzy query with same term, then I do get it (and probably some noise, but with lower scores).

What do you think? I'd just like to set my user's expectations here beforehand.

On Wed, Feb 2, 2011 at 1:18 PM, Shay Banon shay.banon@elasticsearch.com wrote:

The simplest is to use the field query, which also analyzes the query string provided (breaks it into terms).
On Wednesday, February 2, 2011 at 1:35 PM, Enrique Medina Montenegro wrote:

Any feedback on the below?

On Wed, Feb 2, 2011 at 12:36 AM, Enrique Medina Montenegro e.medina.m@gmail.com wrote:

Hi,

I know this may sound as a beginner's question to searching in ES, but I just want to make sure I get it.

Which is the type of query that I should use on searching coincidence terms (for instance, I have indexed lots of products, and I want to look by words):

a) Term --> But in the doc it says that this one is used for "not analyzed" fields.
b) Wildcard --> Seems to work fine, but for some reason it doesn't return the exact matches. For example, if you're looking for "iphone" and then you define the wildcard value as "iphone*", then it won't return any product with "iphone" whatsoever.
c) Fuzzy --> This seems to be the closest to a typical search by words. So if I search with "iphone", then it will also find "phone", "telephone", even "microphone".

So which one should be used in each case? Could someone post some simple usage scenarios for each type of query?

Thanks.


(Enrique Medina Montenegro) #8

Actually I'm creating a new index and analyzing everything from scratch, so
it's sure that everything in ES is being analyzed with that spanish
analyzer.

On Wed, Feb 2, 2011 at 2:06 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Have you made sure to use a new index once you defined the analyzer and
reindex the data so it will use it?

On Wednesday, February 2, 2011 at 2:37 PM, Enrique Medina Montenegro wrote:

Shay,

I'm using the same analyzer that I was using before with just Lucene (with
Compass), which is:

@SuppressWarnings({ "unchecked", "rawtypes" })

public TokenStream tokenStream(String fieldName, Reader reader) {

TokenStream result = new StandardTokenizer(Version.LUCENE_30, reader);

result = new StandardFilter(result);

result = new LowerCaseFilter(result);

result = new ASCIIFoldingFilter(result); // Mi cambio

result = new StopFilter(false, result, new HashSet(Arrays.asList(
SPANISH_STOP_WORDS)));

result = new SnowballFilter(result, "Spanish");

return result;

}

A quick comparison of results showed that searches in "old" Lucene version
where not the same as in ES with the same analyzer, so that's why I started
to change the types of query, and came up with Fuzzy being the most similar
to "old" behaviour, but as Fuzzy is something which should onlu be used in
specific scenarios (due to the fuzziness of the search itself), tht's why I
wanted to double check with the list here...

On Wed, Feb 2, 2011 at 1:27 PM, Shay Banon shay.banon@elasticsearch.comwrote:

This depends on your analyzer, which is used to break the provided text
into tokens when you use field query.

On Wednesday, February 2, 2011 at 2:25 PM, Enrique Medina Montenegro wrote:

But Shay, the results I get with using Fuzzy compared to Field query are
quite different.

This is in Spanish, but for instance, if I select Field query, then the
term provided by the user must be as close as possible to the original term
in the document indexed. For example:

"Crema para maquillarse de uso casero"

If I use Field query with term = "maquillaje", then I don't get that result
above ("maquillaje" is a noun, whereas "maquillarse" is the action verb for
the noun).

But if I use Fuzzy query with same term, then I do get it (and probably
some noise, but with lower scores).

What do you think? I'd just like to set my user's expectations here
beforehand.

On Wed, Feb 2, 2011 at 1:18 PM, Shay Banon shay.banon@elasticsearch.comwrote:

The simplest is to use the field query, which also analyzes the query
string provided (breaks it into terms).

On Wednesday, February 2, 2011 at 1:35 PM, Enrique Medina Montenegro wrote:

Any feedback on the below?

On Wed, Feb 2, 2011 at 12:36 AM, Enrique Medina Montenegro <
e.medina.m@gmail.com> wrote:

Hi,

I know this may sound as a beginner's question to searching in ES, but I
just want to make sure I get it.

Which is the type of query that I should use on searching coincidence terms
(for instance, I have indexed lots of products, and I want to look by
words):

a) Term --> But in the doc it says that this one is used for "not analyzed"
fields.
b) Wildcard --> Seems to work fine, but for some reason it doesn't return
the exact matches. For example, if you're looking for "iphone" and then you
define the wildcard value as "iphone*", then it won't return any product
with "iphone" whatsoever.
c) Fuzzy --> This seems to be the closest to a typical search by words. So
if I search with "iphone", then it will also find "phone", "telephone", even
"microphone".

So which one should be used in each case? Could someone post some simple
usage scenarios for each type of query?

Thanks.


(Enrique Medina Montenegro) #9

Shay,

I found a very clear example of what the difference could be by using Fuzzy
or Field query. Let me explain.

I'm using a catalog of products where I'm just taking the name and
description. I'm also using the Spanish analyzer that I pasted below in my
previous email to analyze by default (with configuration you already
confirmed in some other thread in this list). So far so good.

Now I index some thousands of products, where some of them have the term
"calibration" as part of the name. This term translated to spanish is
"calibración" (so pretty similar).

After having indexed everything, I start my search (using REST API):

a) If I use the Field query, then when I search for exact match
"calibración", I don't get any results:

Filter query:

{

"from" : 0, "size" : 1000,

"query" : {

"field" : {

"nombre" : {

"query" : "calibración"

}

}

}

}

Result JSON:

[{

"hits": {

"hits": [],

"max_score": null,

"total": 0

},

"timed_out": false,

"took": 1,

"_shards": {

"failed": 0,

"total": 3,

"successful": 3

}

}]

Even if I add a '' wildcard to the end, or both to the end and the
beginning of the term, like "calibración
" or "calibración", I still don't
get any results.

b) With the same Field query, now I just provide the root of the term plus a
'' wildcard, i.e. "calibra":

Filter query:

{

"from" : 0, "size" : 1000,

"query" : {

"field" : {

"nombre" : {

"query" : "calibra*"

}

}

}

}

Result JSON:

[{

"hits": {

"hits": [

  {

    "_source": {

      "descripcion": "Monitor 730 Lacie de 30\" para Mac & PC",

      "nombre": "LaCie 730 LCD Monitor + Hood + Calibration Software"

    },

    "_type": "producto",

    "_score": 1,

    "_id": "dXD8xKUFQKCv7jynZ7Z9EQ",

    "_version": 1,

    "_index": "cuestamenos"

  },

  {

    "_source": {

      "descripcion": "Monitor 724 Lacie de 24\" para Mac & PC",

      "nombre": "LaCie 724 LCD Monitor + Hood + Calibration Software"

    },

    "_type": "producto",

    "_score": 1,

    "_id": "3P6fLeuNTTusyj-Y99uAuQ",

    "_version": 1,

    "_index": "cuestamenos"

  },

  {

    "_source": {

      "descripcion": "Monitor 730 Lacie de 30\" para Mac & PC",

      "nombre": "LaCie 730 LCD Monitor + Hood + Calibration Software +

Colorimeter"

    },

    "_type": "producto",

    "_score": 1,

    "_id": "DH-JDVP1RlCvJ82KF7sHcQ",

    "_version": 1,

    "_index": "cuestamenos"

  },

.....

As you can see, I do get the expected results, but I had to trim the search
term to the stem that matches the beginning of "calibration", which is
"calibra" in this case. if I add a simple char like "calibrac*" to my search
term, it won't show any results again.

c) Now I switch to Fuzzy query, and look for exact word "calibración":

Filter query:

{

"from" : 0, "size" : 1000,

"query" : {

"fuzzy" : { "nombre" : "calibración" }

}

}

}

Result JSON:

[{

"hits": {

"hits": [

  {

    "_source": {

      "descripcion": "Monitor 730 Lacie de 30\" para Mac & PC",

      "nombre": "LaCie 730 LCD Monitor + Hood + Calibration Software"

    },

    "_type": "producto",

    "_score": 1,

    "_id": "dXD8xKUFQKCv7jynZ7Z9EQ",

    "_version": 1,

    "_index": "cuestamenos"

  },

  {

    "_source": {

      "descripcion": "Monitor 724 Lacie de 24\" para Mac & PC",

      "nombre": "LaCie 724 LCD Monitor + Hood + Calibration Software"

    },

    "_type": "producto",

    "_score": 1,

    "_id": "3P6fLeuNTTusyj-Y99uAuQ",

    "_version": 1,

    "_index": "cuestamenos"

  },

  {

    "_source": {

      "descripcion": "Monitor 730 Lacie de 30\" para Mac & PC",

      "nombre": "LaCie 730 LCD Monitor + Hood + Calibration Software +

Colorimeter"

    },

    "_type": "producto",

    "_score": 1,

    "_id": "DH-JDVP1RlCvJ82KF7sHcQ",

    "_version": 1,

    "_index": "cuestamenos"

  },

.....

As you can see, the Fuzzy query does return the expected results, as the
spanish user might not know the english word for "calibración", but still
wants to see those results (or at least expects so). And I don't have to
stemmize the search term, but just pass it as the user enters it. Even if I
use a search term like "calibrado" (which states for "calibrated" in
english), it works as expected.

So why would we use the Field query which only seems to return matches when
the stem of the search word is used in conjuction with a '*' wildcard rather
than the Fuzzy query, in general? Any feedback on this proof of concept?

Thanks!

On Wed, Feb 2, 2011 at 2:13 PM, Enrique Medina Montenegro <
e.medina.m@gmail.com> wrote:

Actually I'm creating a new index and analyzing everything from scratch, so
it's sure that everything in ES is being analyzed with that spanish
analyzer.

On Wed, Feb 2, 2011 at 2:06 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Have you made sure to use a new index once you defined the analyzer and
reindex the data so it will use it?

On Wednesday, February 2, 2011 at 2:37 PM, Enrique Medina Montenegro
wrote:

Shay,

I'm using the same analyzer that I was using before with just Lucene (with
Compass), which is:

@SuppressWarnings({ "unchecked", "rawtypes" })

public TokenStream tokenStream(String fieldName, Reader reader) {

TokenStream result = new StandardTokenizer(Version.LUCENE_30, reader);

result = new StandardFilter(result);

result = new LowerCaseFilter(result);

result = new ASCIIFoldingFilter(result); // Mi cambio

result = new StopFilter(false, result, new HashSet(Arrays.asList(
SPANISH_STOP_WORDS)));

result = new SnowballFilter(result, "Spanish");

return result;

}

A quick comparison of results showed that searches in "old" Lucene version
where not the same as in ES with the same analyzer, so that's why I started
to change the types of query, and came up with Fuzzy being the most similar
to "old" behaviour, but as Fuzzy is something which should onlu be used in
specific scenarios (due to the fuzziness of the search itself), tht's why I
wanted to double check with the list here...

On Wed, Feb 2, 2011 at 1:27 PM, Shay Banon shay.banon@elasticsearch.comwrote:

This depends on your analyzer, which is used to break the provided text
into tokens when you use field query.

On Wednesday, February 2, 2011 at 2:25 PM, Enrique Medina Montenegro
wrote:

But Shay, the results I get with using Fuzzy compared to Field query are
quite different.

This is in Spanish, but for instance, if I select Field query, then the
term provided by the user must be as close as possible to the original term
in the document indexed. For example:

"Crema para maquillarse de uso casero"

If I use Field query with term = "maquillaje", then I don't get that
result above ("maquillaje" is a noun, whereas "maquillarse" is the action
verb for the noun).

But if I use Fuzzy query with same term, then I do get it (and probably
some noise, but with lower scores).

What do you think? I'd just like to set my user's expectations here
beforehand.

On Wed, Feb 2, 2011 at 1:18 PM, Shay Banon shay.banon@elasticsearch.comwrote:

The simplest is to use the field query, which also analyzes the query
string provided (breaks it into terms).

On Wednesday, February 2, 2011 at 1:35 PM, Enrique Medina Montenegro
wrote:

Any feedback on the below?

On Wed, Feb 2, 2011 at 12:36 AM, Enrique Medina Montenegro <
e.medina.m@gmail.com> wrote:

Hi,

I know this may sound as a beginner's question to searching in ES, but I
just want to make sure I get it.

Which is the type of query that I should use on searching coincidence
terms (for instance, I have indexed lots of products, and I want to look by
words):

a) Term --> But in the doc it says that this one is used for "not
analyzed" fields.
b) Wildcard --> Seems to work fine, but for some reason it doesn't return
the exact matches. For example, if you're looking for "iphone" and then you
define the wildcard value as "iphone*", then it won't return any product
with "iphone" whatsoever.
c) Fuzzy --> This seems to be the closest to a typical search by words. So
if I search with "iphone", then it will also find "phone", "telephone", even
"microphone".

So which one should be used in each case? Could someone post some simple
usage scenarios for each type of query?

Thanks.


(system) #10