Problem when using analyzers (very small data set)

Hi,

I've been trying to use some new analyzers in my ES instance (version
0.20.4) and I've been noticing some problems on search. I've been trying to
follow this simple
example: http://mnylen.tumblr.com/post/22963609412/elasticsearch-and-a-simple-contains-search

Here's how I set up my data:

curl -XPUT "http://localhost:9200/catalog/?pretty" -d '{

"mappings" : {

"product" : {

  "properties" : {

    "title" : {

      "type" : "string",

      "search_analyzer" : "str_search_analyzer",

      "index_analyzer" : "str_index_analyzer"

    }

  }

}

},

"settings" : {

"analysis" : {

  "analyzer" : {

    "str_search_analyzer" : {

      *"tokenizer" : "whitespace",*

      "filter" : ["lowercase"]

    },

    "str_index_analyzer" : {

      *"tokenizer" : "whitespace",*

      "filter" : ["lowercase", "substring"]

    }

  },

  "filter" : {

    "substring" : {

      "type" : "nGram",

      *"min_gram"** : 6*,

      "max_gram"  : 20

    }

  }

}

}

}';

curl -XPOST "http://localhost:9200/catalog/product?pretty" -d '{

"title" : "Logitech Wireless Keyboard K350"

}';

curl -XPOST "http://localhost:9200/catalog/product?pretty" -d '{

"title" : "Das Keyboard"

}'

Some things I changed from the example where "tokenizer" : "keyword" is
now "tokenizer" : "whitespace" and "min_gram" is now 6 instead of 1.

The first problem is that unless I specify a field, searching with tokens
don't work.

The tokens for the word "Logitech" are "gitech", "logite", "logitec",
"logitech", "ogitec" and "ogitech". So I would expect that searching any of
those substrings would return me the Logitech result, however this returns
no results:

http://localhost:9200/catalog/_search?q=ogitec
{"took": 9,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed
": 0},"hits": {"total": 0,"max_score": null,"hits": []}}

If I do the same search and specify title, only then do I get the expected
result:

http://localhost:9200/catalog/_search?q=title:ogitec
{"took": 4,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed
": 0},"hits": {"total": 1,"max_score": 0.067124054,"hits": [{"_index": "
catalog","_type": "product","_id": "J4trtjx9Rwm5jdI0kqmygA","_score":
0.067124054,"_source": {"title": "Logitech Wireless Keyboard K350"}}]}}

Does anyone know why this is happening?

The second thing I noticed (using the same data setup) was that when I
analyze the strings I'm searching I get two different types. Using the
default analyzer on the index catalog shows the type as an :

http://localhost:9200/catalog/_analyze?pretty&text=ogitec
{"tokens": [{"token": "ogitec","start_offset": 0,"end_offset": 6,"type": "
","position": 1}]}
But using my custom analyzer shows the type as a "word":

http://localhost:9200/catalog/_analyze?pretty&text=ogitec&analyzer=str_search_analyzer
{"tokens": [{"token": "ogitec","start_offset": 0,"end_offset": 6,"type": "
word","position": 1}]}

I'm wondering if this has something to do with the problem above
(searching over all fields in an index as opposed to a specified one), and
why it is happening.

Any help is appreciated.
(Also, sorry for the length of the post. I didn't want to leave out any
information)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

You're only applying this custom analysis on the title field in your
mapping properties. If you don't specify that field in your searches then a
default field is used (_all in this case).

-Eric

On Monday, March 4, 2013 10:42:31 AM UTC-5, Paul W wrote:

Hi,

I've been trying to use some new analyzers in my ES instance (version
0.20.4) and I've been noticing some problems on search. I've been trying to
follow this simple example:
http://mnylen.tumblr.com/post/22963609412/elasticsearch-and-a-simple-contains-search

Here's how I set up my data:

curl -XPUT "http://localhost:9200/catalog/?pretty" -d '{

"mappings" : {

"product" : {

  "properties" : {

    "title" : {

      "type" : "string",

      "search_analyzer" : "str_search_analyzer",

      "index_analyzer" : "str_index_analyzer"

    }

  }

}

},

"settings" : {

"analysis" : {

  "analyzer" : {

    "str_search_analyzer" : {

      *"tokenizer" : "whitespace",*

      "filter" : ["lowercase"]

    },

    "str_index_analyzer" : {

      *"tokenizer" : "whitespace",*

      "filter" : ["lowercase", "substring"]

    }

  },

  "filter" : {

    "substring" : {

      "type" : "nGram",

      *"min_gram"** : 6*,

      "max_gram"  : 20

    }

  }

}

}

}';

curl -XPOST "http://localhost:9200/catalog/product?pretty" -d '{

"title" : "Logitech Wireless Keyboard K350"

}';

curl -XPOST "http://localhost:9200/catalog/product?pretty" -d '{

"title" : "Das Keyboard"

}'

Some things I changed from the example where "tokenizer" : "keyword" is
now "tokenizer" : "whitespace" and "min_gram" is now 6 instead of 1.

The first problem is that unless I specify a field, searching with tokens
don't work.

The tokens for the word "Logitech" are "gitech", "logite", "logitec",
"logitech", "ogitec" and "ogitech". So I would expect that searching any of
those substrings would return me the Logitech result, however this returns
no results:

http://localhost:9200/catalog/_search?q=ogitec
{"took": 9,"timed_out": false,"_shards": {"total": 5,"successful": 5,"
failed": 0},"hits": {"total": 0,"max_score": null,"hits": }}

If I do the same search and specify title, only then do I get the expected
result:

http://localhost:9200/catalog/_search?q=title:ogitec
{"took": 4,"timed_out": false,"_shards": {"total": 5,"successful": 5,"
failed": 0},"hits": {"total": 1,"max_score": 0.067124054,"hits": [{"_index
": "catalog","_type": "product","_id": "J4trtjx9Rwm5jdI0kqmygA","_score":
0.067124054,"_source": {"title": "Logitech Wireless Keyboard K350"}}]}}

Does anyone know why this is happening?

The second thing I noticed (using the same data setup) was that when I
analyze the strings I'm searching I get two different types. Using the
default analyzer on the index catalog shows the type as an :

http://localhost:9200/catalog/_analyze?pretty&text=ogitec
{"tokens": [{"token": "ogitec","start_offset": 0,"end_offset": 6,"type": "
","position": 1}]}
But using my custom analyzer shows the type as a "word":

http://localhost:9200/catalog/_analyze?pretty&text=ogitec&analyzer=str_search_analyzer
{"tokens": [{"token": "ogitec","start_offset": 0,"end_offset": 6,"type": "
word","position": 1}]}

I'm wondering if this has something to do with the problem above
(searching over all fields in an index as opposed to a specified one), and
why it is happening.

Any help is appreciated.
(Also, sorry for the length of the post. I didn't want to leave out any
information)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Oh ok. Well that makes sense. Cool, thanks.

Also, do you have any idea why the type is showing as "" with
standard analysis and "word" when I use my custom analyzer? Do you think
this will make a difference when searching?

On Monday, 4 March 2013 17:39:05 UTC, egaumer wrote:

You're only applying this custom analysis on the title field in your
mapping properties. If you don't specify that field in your searches then a
default field is used (_all in this case).

-Eric

On Monday, March 4, 2013 10:42:31 AM UTC-5, Paul W wrote:

Hi,

I've been trying to use some new analyzers in my ES instance (version
0.20.4) and I've been noticing some problems on search. I've been trying to
follow this simple example:
http://mnylen.tumblr.com/post/22963609412/elasticsearch-and-a-simple-contains-search

Here's how I set up my data:

curl -XPUT "http://localhost:9200/catalog/?pretty" -d '{

"mappings" : {

"product" : {

  "properties" : {

    "title" : {

      "type" : "string",

      "search_analyzer" : "str_search_analyzer",

      "index_analyzer" : "str_index_analyzer"

    }

  }

}

},

"settings" : {

"analysis" : {

  "analyzer" : {

    "str_search_analyzer" : {

      *"tokenizer" : "whitespace",*

      "filter" : ["lowercase"]

    },

    "str_index_analyzer" : {

      *"tokenizer" : "whitespace",*

      "filter" : ["lowercase", "substring"]

    }

  },

  "filter" : {

    "substring" : {

      "type" : "nGram",

      *"min_gram"** : 6*,

      "max_gram"  : 20

    }

  }

}

}

}';

curl -XPOST "http://localhost:9200/catalog/product?pretty" -d '{

"title" : "Logitech Wireless Keyboard K350"

}';

curl -XPOST "http://localhost:9200/catalog/product?pretty" -d '{

"title" : "Das Keyboard"

}'

Some things I changed from the example where "tokenizer" : "keyword" is
now "tokenizer" : "whitespace" and "min_gram" is now 6 instead of 1.

The first problem is that unless I specify a field, searching with tokens
don't work.

The tokens for the word "Logitech" are "gitech", "logite", "logitec",
"logitech", "ogitec" and "ogitech". So I would expect that searching any of
those substrings would return me the Logitech result, however this returns
no results:

http://localhost:9200/catalog/_search?q=ogitec
{"took": 9,"timed_out": false,"_shards": {"total": 5,"successful": 5,"
failed": 0},"hits": {"total": 0,"max_score": null,"hits": }}

If I do the same search and specify title, only then do I get the
expected result:

http://localhost:9200/catalog/_search?q=title:ogitec
{"took": 4,"timed_out": false,"_shards": {"total": 5,"successful": 5,"
failed": 0},"hits": {"total": 1,"max_score": 0.067124054,"hits": [{"
_index": "catalog","_type": "product","_id": "J4trtjx9Rwm5jdI0kqmygA","
_score": 0.067124054,"_source": {"title": "Logitech Wireless Keyboard
K350"}}]}}

Does anyone know why this is happening?

The second thing I noticed (using the same data setup) was that when I
analyze the strings I'm searching I get two different types. Using the
default analyzer on the index catalog shows the type as an :

http://localhost:9200/catalog/_analyze?pretty&text=ogitec
{"tokens": [{"token": "ogitec","start_offset": 0,"end_offset": 6,"type":
"","position": 1}]}
But using my custom analyzer shows the type as a "word":

http://localhost:9200/catalog/_analyze?pretty&text=ogitec&analyzer=str_search_analyzer
{"tokens": [{"token": "ogitec","start_offset": 0,"end_offset": 6,"type":
"word","position": 1}]}

I'm wondering if this has something to do with the problem above
(searching over all fields in an index as opposed to a specified one), and
why it is happening.

Any help is appreciated.
(Also, sorry for the length of the post. I didn't want to leave out any
information)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.