Weird behavior of Elasticsearch 1.0.1


(Huy Phan) #1

Hi all,

I bumped into this weird behavior of Elasticsearch:

Basically what I did is to create a comma analyzer and and use it as the
default one. Then I indexed this document

{
"random_string" : "ABC,XYZ",
"random_number" : "123456,7890123",
"random_email" : "abc@foobar.com,abc@foobar.net"
}

Then search for it with query "123456", I got no hit. However if I did
everything from scratch and indexed a slightly different document (it's
actually the same doc with first field removed):

{
"random_number" : "123456,7890123",
"random_email" : "abc@foobar.com,abc@foobar.net"
}

The same old query did give me the result. I'm not sure what is the
difference between the 2 documents that causes this behavior.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cf6e6972-ed30-4c42-adb5-b86d844a7167%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Luca Cavanna) #2

As far as I can see from your recreation you only create the analyzer but
don't associate it to your fields by specifying your mappings. Also, when
you query you don't soecify the field you want to query, thus you are using
the _all which has its own analyzer, which means that even if you had
specified the proper mappings the query would execute against a different
field with a different analyzer.

On Monday, March 31, 2014 12:12:37 PM UTC+2, Huy Phan wrote:

Hi all,

I bumped into this weird behavior of Elasticsearch:
https://gist.github.com/huyphan/9888959https://www.google.com/url?q=https%3A%2F%2Fgist.github.com%2Fhuyphan%2F9888959&sa=D&sntz=1&usg=AFQjCNH4SNtSUHvK2yfyGrFL2mqfyD-vIQ

Basically what I did is to create a comma analyzer and and use it as the
default one. Then I indexed this document

{
"random_string" : "ABC,XYZ",
"random_number" : "123456,7890123",
"random_email" : "a...@foobar.com <javascript:>,a...@foobar.net<javascript:>
"
}

Then search for it with query "123456", I got no hit. However if I did
everything from scratch and indexed a slightly different document (it's
actually the same doc with first field removed):

{
"random_number" : "123456,7890123",
"random_email" : "a...@foobar.com <javascript:>,a...@foobar.net<javascript:>
"
}

The same old query did give me the result. I'm not sure what is the
difference between the 2 documents that causes this behavior.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8ef407cb-c37e-45b1-b98b-8386d55b17d9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Huy Phan) #3

Hi Luca,

The configuration index.analysis.analyzer.default_index is already set so I
don't think there's a need to specify my mappings since I actually want to
use the comma analyzer for all the fields. And from what I understand, that
default_index is also applied to _all field.
As you could see in my gist, I also overrode the "standard" analyzer since
I doubted something went wrong with defaul_index.

You may ask about the default_search configuration, my query "123456" is
rather simple so I don't think the default analyzer would make any changes
on it (and yes, I did verify that using the Analyzer API).

Even if there's something wrong with my settings, that still doesn't
clearly explain why I got the result with the second document but not with
the first one.

On Monday, 31 March 2014 19:45:42 UTC+8, Luca Cavanna wrote:

As far as I can see from your recreation you only create the analyzer but
don't associate it to your fields by specifying your mappings. Also, when
you query you don't soecify the field you want to query, thus you are using
the _all which has its own analyzer, which means that even if you had
specified the proper mappings the query would execute against a different
field with a different analyzer.

On Monday, March 31, 2014 12:12:37 PM UTC+2, Huy Phan wrote:

Hi all,

I bumped into this weird behavior of Elasticsearch:
https://gist.github.com/huyphan/9888959https://www.google.com/url?q=https%3A%2F%2Fgist.github.com%2Fhuyphan%2F9888959&sa=D&sntz=1&usg=AFQjCNH4SNtSUHvK2yfyGrFL2mqfyD-vIQ

Basically what I did is to create a comma analyzer and and use it as the
default one. Then I indexed this document

{
"random_string" : "ABC,XYZ",
"random_number" : "123456,7890123",
"random_email" : "a...@foobar.com,a...@foobar.net"
}

Then search for it with query "123456", I got no hit. However if I did
everything from scratch and indexed a slightly different document (it's
actually the same doc with first field removed):

{
"random_number" : "123456,7890123",
"random_email" : "a...@foobar.com,a...@foobar.net"
}

The same old query did give me the result. I'm not sure what is the
difference between the 2 documents that causes this behavior.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9b25e5f4-22a2-48e0-8ab2-4c72f4d8d25e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Luca Cavanna) #4

Right, I did miss a couple of things there, sorry about that. Will have
another look and get back to you then :slight_smile:

On Mon, Mar 31, 2014 at 2:23 PM, Huy Phan dachuy@gmail.com wrote:

Hi Luca,

The configuration index.analysis.analyzer.default_index is already set so
I don't think there's a need to specify my mappings since I actually want
to use the comma analyzer for all the fields. And from what I understand,
that default_index is also applied to _all field.
As you could see in my gist, I also overrode the "standard" analyzer since
I doubted something went wrong with defaul_index.

You may ask about the default_search configuration, my query "123456" is
rather simple so I don't think the default analyzer would make any changes
on it (and yes, I did verify that using the Analyzer API).

Even if there's something wrong with my settings, that still doesn't
clearly explain why I got the result with the second document but not with
the first one.

On Monday, 31 March 2014 19:45:42 UTC+8, Luca Cavanna wrote:

As far as I can see from your recreation you only create the analyzer but
don't associate it to your fields by specifying your mappings. Also, when
you query you don't soecify the field you want to query, thus you are using
the _all which has its own analyzer, which means that even if you had
specified the proper mappings the query would execute against a different
field with a different analyzer.

On Monday, March 31, 2014 12:12:37 PM UTC+2, Huy Phan wrote:

Hi all,

I bumped into this weird behavior of Elasticsearch: https://gist.
github.com/huyphan/9888959https://www.google.com/url?q=https%3A%2F%2Fgist.github.com%2Fhuyphan%2F9888959&sa=D&sntz=1&usg=AFQjCNH4SNtSUHvK2yfyGrFL2mqfyD-vIQ

Basically what I did is to create a comma analyzer and and use it as the
default one. Then I indexed this document

{
"random_string" : "ABC,XYZ",
"random_number" : "123456,7890123",
"random_email" : "a...@foobar.com,a...@foobar.net"
}

Then search for it with query "123456", I got no hit. However if I did
everything from scratch and indexed a slightly different document (it's
actually the same doc with first field removed):

{
"random_number" : "123456,7890123",
"random_email" : "a...@foobar.com,a...@foobar.net"
}

The same old query did give me the result. I'm not sure what is the
difference between the 2 documents that causes this behavior.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/UOkKVNopk9M/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/9b25e5f4-22a2-48e0-8ab2-4c72f4d8d25e%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/9b25e5f4-22a2-48e0-8ab2-4c72f4d8d25e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CADdZ9MU79v16C%3DumRxXRihu17dOA3f7atbcHhUYY29G%2BFy8REg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #5

This is expected behavior with _all field.

For demonstration I extended your gist a bit.

Some hints:

  • custom tokenizer should be used in a field that is configured in a mapping

  • always set both search and index analyzer for a field

  • avoid setting up a custom tokenizer for _all when including more than one
    field to _all (which is the default). This will give unpredictable results
    because tokens from many fields are merged into _all. In edge cases, when a
    field is first for example, you may be able to produce a hit. But this is
    pure accidentally.

  • when searching with q parameter, do not forget to specify field name

Jörg

On Mon, Mar 31, 2014 at 2:23 PM, Huy Phan dachuy@gmail.com wrote:

Hi Luca,

The configuration index.analysis.analyzer.default_index is already set so
I don't think there's a need to specify my mappings since I actually want
to use the comma analyzer for all the fields. And from what I understand,
that default_index is also applied to _all field.
As you could see in my gist, I also overrode the "standard" analyzer since
I doubted something went wrong with defaul_index.

You may ask about the default_search configuration, my query "123456" is
rather simple so I don't think the default analyzer would make any changes
on it (and yes, I did verify that using the Analyzer API).

Even if there's something wrong with my settings, that still doesn't
clearly explain why I got the result with the second document but not with
the first one.

On Monday, 31 March 2014 19:45:42 UTC+8, Luca Cavanna wrote:

As far as I can see from your recreation you only create the analyzer but
don't associate it to your fields by specifying your mappings. Also, when
you query you don't soecify the field you want to query, thus you are using
the _all which has its own analyzer, which means that even if you had
specified the proper mappings the query would execute against a different
field with a different analyzer.

On Monday, March 31, 2014 12:12:37 PM UTC+2, Huy Phan wrote:

Hi all,

I bumped into this weird behavior of Elasticsearch: https://gist.
github.com/huyphan/9888959https://www.google.com/url?q=https%3A%2F%2Fgist.github.com%2Fhuyphan%2F9888959&sa=D&sntz=1&usg=AFQjCNH4SNtSUHvK2yfyGrFL2mqfyD-vIQ

Basically what I did is to create a comma analyzer and and use it as the
default one. Then I indexed this document

{
"random_string" : "ABC,XYZ",
"random_number" : "123456,7890123",
"random_email" : "a...@foobar.com,a...@foobar.net"
}

Then search for it with query "123456", I got no hit. However if I did
everything from scratch and indexed a slightly different document (it's
actually the same doc with first field removed):

{
"random_number" : "123456,7890123",
"random_email" : "a...@foobar.com,a...@foobar.net"
}

The same old query did give me the result. I'm not sure what is the
difference between the 2 documents that causes this behavior.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/9b25e5f4-22a2-48e0-8ab2-4c72f4d8d25e%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/9b25e5f4-22a2-48e0-8ab2-4c72f4d8d25e%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHOz2sLRPdqSWY1B3m7tuaXVc%2BRjCEdhProONDwXc3TUg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Huy Phan) #6

I didn't notice that _all field turned out to be unpredictable at times.

There are certain reasons that we don't want to (or we can't) predefine our
mappings when creating index, that's why I used the default_indexconfiguration there.

What I'm doing is to to implement a google-like search with Elasticsearch
so I don't want to specify any field when searching. I figured out that I
have to create another field to aggregate the terms by myself instead of
relying on _all field.

Anyway, that was great answer and it did help me to understand my problem.

Thanks Jörg.

On Monday, 31 March 2014 21:09:06 UTC+8, Jörg Prante wrote:

This is expected behavior with _all field.

For demonstration I extended your gist a bit.

https://gist.github.com/jprante/9891706

Some hints:

  • custom tokenizer should be used in a field that is configured in a
    mapping

  • always set both search and index analyzer for a field

  • avoid setting up a custom tokenizer for _all when including more than
    one field to _all (which is the default). This will give unpredictable
    results because tokens from many fields are merged into _all. In edge
    cases, when a field is first for example, you may be able to produce a hit.
    But this is pure accidentally.

  • when searching with q parameter, do not forget to specify field name

Jörg

On Mon, Mar 31, 2014 at 2:23 PM, Huy Phan <dac...@gmail.com <javascript:>>wrote:

Hi Luca,

The configuration index.analysis.analyzer.default_index is already set
so I don't think there's a need to specify my mappings since I actually
want to use the comma analyzer for all the fields. And from what I
understand, that default_index is also applied to _all field.
As you could see in my gist, I also overrode the "standard" analyzer
since I doubted something went wrong with defaul_index.

You may ask about the default_search configuration, my query "123456" is
rather simple so I don't think the default analyzer would make any changes
on it (and yes, I did verify that using the Analyzer API).

Even if there's something wrong with my settings, that still doesn't
clearly explain why I got the result with the second document but not with
the first one.

On Monday, 31 March 2014 19:45:42 UTC+8, Luca Cavanna wrote:

As far as I can see from your recreation you only create the analyzer
but don't associate it to your fields by specifying your mappings. Also,
when you query you don't soecify the field you want to query, thus you are
using the _all which has its own analyzer, which means that even if you had
specified the proper mappings the query would execute against a different
field with a different analyzer.

On Monday, March 31, 2014 12:12:37 PM UTC+2, Huy Phan wrote:

Hi all,

I bumped into this weird behavior of Elasticsearch: https://gist.
github.com/huyphan/9888959https://www.google.com/url?q=https%3A%2F%2Fgist.github.com%2Fhuyphan%2F9888959&sa=D&sntz=1&usg=AFQjCNH4SNtSUHvK2yfyGrFL2mqfyD-vIQ

Basically what I did is to create a comma analyzer and and use it as
the default one. Then I indexed this document

{
"random_string" : "ABC,XYZ",
"random_number" : "123456,7890123",
"random_email" : "a...@foobar.com,a...@foobar.net"
}

Then search for it with query "123456", I got no hit. However if I did
everything from scratch and indexed a slightly different document (it's
actually the same doc with first field removed):

{
"random_number" : "123456,7890123",
"random_email" : "a...@foobar.com,a...@foobar.net"
}

The same old query did give me the result. I'm not sure what is the
difference between the 2 documents that causes this behavior.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/9b25e5f4-22a2-48e0-8ab2-4c72f4d8d25e%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/9b25e5f4-22a2-48e0-8ab2-4c72f4d8d25e%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/12230bc6-87c0-4e42-981b-d56f3c99ef3c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #7

Just a note regarding google-like search, for this purpose, there is
"simple query string query"

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html

which makes life easier than query_string (which is attached to the 'q'
parameter) because it does not bail out with syntax errors.

Jörg

On Mon, Mar 31, 2014 at 4:21 PM, Huy Phan dachuy@gmail.com wrote:

What I'm doing is to to implement a google-like search with Elasticsearch
so I don't want to specify any field when searching.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHiSxnkaSz2TNQ1KXui2d7Pz1Z-%2BV8Z5MZYQGwEX4Xt9g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #8