Stop Words in Quoted vs Unquoted Search


(Kenneth Loafman) #1

Hi,

We're seeing an odd problem in search that revolves around stop words.

If I use the query (no quotes):
worst way to travel
ES returns nothing.

If I use the query (with quotes):
"worst way to travel"
ES returns matches.

Why the difference? Do I need to remove stop words manually from a
non-quoted query?

...Ken


(Shay Banon) #2

Thats strange, the first option simply translates to a boolean query and
the other one translates to a phrase query. Can you gist a recreation?

On Mon, Apr 9, 2012 at 9:24 PM, Kenneth Loafman
kenneth.loafman@gmail.comwrote:

Hi,

We're seeing an odd problem in search that revolves around stop words.

If I use the query (no quotes):
worst way to travel
ES returns nothing.

If I use the query (with quotes):
"worst way to travel"
ES returns matches.

Why the difference? Do I need to remove stop words manually from a
non-quoted query?

...Ken


(Kenneth Loafman-2) #3

Here's the gist: https://gist.github.com/2359644

...Thanks,
...Ken

On Wed, Apr 11, 2012 at 6:30 AM, Shay Banon kimchy@gmail.com wrote:

Thats strange, the first option simply translates to a boolean query and
the other one translates to a phrase query. Can you gist a recreation?

On Mon, Apr 9, 2012 at 9:24 PM, Kenneth Loafman <kenneth.loafman@gmail.com

wrote:

Hi,

We're seeing an odd problem in search that revolves around stop words.

If I use the query (no quotes):
worst way to travel
ES returns nothing.

If I use the query (with quotes):
"worst way to travel"
ES returns matches.

Why the difference? Do I need to remove stop words manually from a
non-quoted query?

...Ken


(Igor Motov) #4

This recreation doesn't seem to recreate the problem. The issue is most
likely in the mapping, and unfortunately, it's not visible in your example.
This is what I am getting when I am trying to run
it: https://gist.github.com/2360279 (I had to remove publish_datetime
sort, since it's not populated in your gist).

On Wednesday, April 11, 2012 10:30:30 AM UTC-4, Kenneth Loafman wrote:

Here's the gist: https://gist.github.com/2359644

...Thanks,
...Ken

On Wed, Apr 11, 2012 at 6:30 AM, Shay Banon kimchy@gmail.com wrote:

Thats strange, the first option simply translates to a boolean query and
the other one translates to a phrase query. Can you gist a recreation?

On Mon, Apr 9, 2012 at 9:24 PM, Kenneth Loafman <
kenneth.loafman@gmail.com> wrote:

Hi,

We're seeing an odd problem in search that revolves around stop words.

If I use the query (no quotes):
worst way to travel
ES returns nothing.

If I use the query (with quotes):
"worst way to travel"
ES returns matches.

Why the difference? Do I need to remove stop words manually from a
non-quoted query?

...Ken


(Kenneth Loafman-2) #5

It will run without the publish_datetime sort, that's part of the gist I
posted. Does NOT matter whether its populated or not.

...Ken

On Wed, Apr 11, 2012 at 11:13 AM, Igor Motov imotov@gmail.com wrote:

This recreation doesn't seem to recreate the problem. The issue is most
likely in the mapping, and unfortunately, it's not visible in your example.
This is what I am getting when I am trying to run it:
https://gist.github.com/2360279 (I had to remove publish_datetime sort,
since it's not populated in your gist).

On Wednesday, April 11, 2012 10:30:30 AM UTC-4, Kenneth Loafman wrote:

Here's the gist: https://gist.github.com/**2359644https://gist.github.com/2359644

...Thanks,
...Ken

On Wed, Apr 11, 2012 at 6:30 AM, Shay Banon kimchy@gmail.com wrote:

Thats strange, the first option simply translates to a boolean query and
the other one translates to a phrase query. Can you gist a recreation?

On Mon, Apr 9, 2012 at 9:24 PM, Kenneth Loafman <
kenneth.loafman@gmail.com> wrote:

Hi,

We're seeing an odd problem in search that revolves around stop words.

If I use the query (no quotes):
worst way to travel
ES returns nothing.

If I use the query (with quotes):
"worst way to travel"
ES returns matches.

Why the difference? Do I need to remove stop words manually from a
non-quoted query?

...Ken


(Kenneth Loafman-2) #6

Just to make sure, I updated the original gist with a test that has the
full mapping. It does not work. The output is from 0.19.1.

...Ken

On Wed, Apr 11, 2012 at 3:28 PM, Kenneth Loafman kenneth@loafman.comwrote:

It will run without the publish_datetime sort, that's part of the gist I
posted. Does NOT matter whether its populated or not.

...Ken

On Wed, Apr 11, 2012 at 11:13 AM, Igor Motov imotov@gmail.com wrote:

This recreation doesn't seem to recreate the problem. The issue is most
likely in the mapping, and unfortunately, it's not visible in your example.
This is what I am getting when I am trying to run it:
https://gist.github.com/2360279 (I had to remove publish_datetime sort,
since it's not populated in your gist).

On Wednesday, April 11, 2012 10:30:30 AM UTC-4, Kenneth Loafman wrote:

Here's the gist: https://gist.github.com/**2359644https://gist.github.com/2359644

...Thanks,
...Ken

On Wed, Apr 11, 2012 at 6:30 AM, Shay Banon kimchy@gmail.com wrote:

Thats strange, the first option simply translates to a boolean query
and the other one translates to a phrase query. Can you gist a recreation?

On Mon, Apr 9, 2012 at 9:24 PM, Kenneth Loafman <
kenneth.loafman@gmail.com> wrote:

Hi,

We're seeing an odd problem in search that revolves around stop words.

If I use the query (no quotes):
worst way to travel
ES returns nothing.

If I use the query (with quotes):
"worst way to travel"
ES returns matches.

Why the difference? Do I need to remove stop words manually from a
non-quoted query?

...Ken


(Kenneth Loafman-2) #7

0.18.7 gives the same results.

BTW, original is here: https://gist.github.com/2359644

...Ken

On Wed, Apr 11, 2012 at 3:55 PM, Kenneth Loafman kenneth@loafman.comwrote:

Just to make sure, I updated the original gist with a test that has the
full mapping. It does not work. The output is from 0.19.1.

...Ken

On Wed, Apr 11, 2012 at 3:28 PM, Kenneth Loafman kenneth@loafman.comwrote:

It will run without the publish_datetime sort, that's part of the gist I
posted. Does NOT matter whether its populated or not.

...Ken

On Wed, Apr 11, 2012 at 11:13 AM, Igor Motov imotov@gmail.com wrote:

This recreation doesn't seem to recreate the problem. The issue is most
likely in the mapping, and unfortunately, it's not visible in your example.
This is what I am getting when I am trying to run it:
https://gist.github.com/2360279 (I had to remove publish_datetime
sort, since it's not populated in your gist).

On Wednesday, April 11, 2012 10:30:30 AM UTC-4, Kenneth Loafman wrote:

Here's the gist: https://gist.github.com/**2359644https://gist.github.com/2359644

...Thanks,
...Ken

On Wed, Apr 11, 2012 at 6:30 AM, Shay Banon kimchy@gmail.com wrote:

Thats strange, the first option simply translates to a boolean query
and the other one translates to a phrase query. Can you gist a recreation?

On Mon, Apr 9, 2012 at 9:24 PM, Kenneth Loafman <
kenneth.loafman@gmail.com> wrote:

Hi,

We're seeing an odd problem in search that revolves around stop words.

If I use the query (no quotes):
worst way to travel
ES returns nothing.

If I use the query (with quotes):
"worst way to travel"
ES returns matches.

Why the difference? Do I need to remove stop words manually from a
non-quoted query?

...Ken


(Igor Motov) #8

I still cannot reproduce it. Something is missing. Most likely it's index
settings. Could you share your config file or add output of the following
command to the repro?

curl -XGET "localhost:9200/test/_settings"

On Wednesday, April 11, 2012 5:13:13 PM UTC-4, Kenneth Loafman wrote:

0.18.7 gives the same results.

BTW, original is here: https://gist.github.com/2359644

...Ken

On Wed, Apr 11, 2012 at 3:55 PM, Kenneth Loafman kenneth@loafman.comwrote:

Just to make sure, I updated the original gist with a test that has the
full mapping. It does not work. The output is from 0.19.1.

...Ken

On Wed, Apr 11, 2012 at 3:28 PM, Kenneth Loafman kenneth@loafman.comwrote:

It will run without the publish_datetime sort, that's part of the gist I
posted. Does NOT matter whether its populated or not.

...Ken

On Wed, Apr 11, 2012 at 11:13 AM, Igor Motov imotov@gmail.com wrote:

This recreation doesn't seem to recreate the problem. The issue is most
likely in the mapping, and unfortunately, it's not visible in your example.
This is what I am getting when I am trying to run it:
https://gist.github.com/2360279 (I had to remove publish_datetime
sort, since it's not populated in your gist).

On Wednesday, April 11, 2012 10:30:30 AM UTC-4, Kenneth Loafman wrote:

Here's the gist: https://gist.github.com/**2359644https://gist.github.com/2359644

...Thanks,
...Ken

On Wed, Apr 11, 2012 at 6:30 AM, Shay Banon kimchy@gmail.com wrote:

Thats strange, the first option simply translates to a boolean query
and the other one translates to a phrase query. Can you gist a recreation?

On Mon, Apr 9, 2012 at 9:24 PM, Kenneth Loafman <
kenneth.loafman@gmail.com> wrote:

Hi,

We're seeing an odd problem in search that revolves around stop
words.

If I use the query (no quotes):
worst way to travel
ES returns nothing.

If I use the query (with quotes):
"worst way to travel"
ES returns matches.

Why the difference? Do I need to remove stop words manually from a
non-quoted query?

...Ken


(Kenneth Loafman-2) #9

Here it is:

~$ curl -XGET "localhost:9200/test/_settings?pretty=true"
{
"test" : {
"settings" : {
"index.number_of_shards" : "2",
"index.number_of_replicas" : "1",
"index.version.created" : "190199"
}
}
}

On Wed, Apr 11, 2012 at 5:06 PM, Igor Motov imotov@gmail.com wrote:

I still cannot reproduce it. Something is missing. Most likely it's index
settings. Could you share your config file or add output of the following
command to the repro?

curl -XGET "localhost:9200/test/_settings"

On Wednesday, April 11, 2012 5:13:13 PM UTC-4, Kenneth Loafman wrote:

0.18.7 gives the same results.

BTW, original is here: https://gist.github.com/**2359644https://gist.github.com/2359644

...Ken

On Wed, Apr 11, 2012 at 3:55 PM, Kenneth Loafman kenneth@loafman.comwrote:

Just to make sure, I updated the original gist with a test that has the
full mapping. It does not work. The output is from 0.19.1.

...Ken

On Wed, Apr 11, 2012 at 3:28 PM, Kenneth Loafman kenneth@loafman.comwrote:

It will run without the publish_datetime sort, that's part of the gist
I posted. Does NOT matter whether its populated or not.

...Ken

On Wed, Apr 11, 2012 at 11:13 AM, Igor Motov imotov@gmail.com wrote:

This recreation doesn't seem to recreate the problem. The issue is
most likely in the mapping, and unfortunately, it's not visible in your
example. This is what I am getting when I am trying to run it:
https://gist.github.com/**2360279 https://gist.github.com/2360279 (I had to remove publish_datetime sort, since it's not populated in your
gist).

On Wednesday, April 11, 2012 10:30:30 AM UTC-4, Kenneth Loafman wrote:

Here's the gist: https://gist.github.com/****2359644https://gist.github.com/2359644

...Thanks,
...Ken

On Wed, Apr 11, 2012 at 6:30 AM, Shay Banon kimchy@gmail.com wrote:

Thats strange, the first option simply translates to a boolean query
and the other one translates to a phrase query. Can you gist a recreation?

On Mon, Apr 9, 2012 at 9:24 PM, Kenneth Loafman <
kenneth.loafman@gmail.com> wrote:

Hi,

We're seeing an odd problem in search that revolves around stop
words.

If I use the query (no quotes):
worst way to travel
ES returns nothing.

If I use the query (with quotes):
"worst way to travel"
ES returns matches.

Why the difference? Do I need to remove stop words manually from a
non-quoted query?

...Ken


(Kenneth Loafman-2) #10

Here's elasticsearch.yml just in case...

gateway:
type: local

path :
data : /mnt/search-data-dev/node0/data
work : /mnt/search-data-dev/node0/work
logs : /mnt/search-data-dev/node0/logs

index :
number_of_shards : 2
number_of_replicas : 1

bootstrap:
mlockall: true

action:
disable_delete_all_indexes: true

network :
host : localhost

discovery.zen.ping.multicast:
enabled: false

discovery.zen.ping.unicast:
hosts: ["localhost:9300","localhost:9301"]

Shard level query and fetch threshold logging.

#index.search.slowlog.level: TRACE
index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s
#index.search.slowlog.threshold.query.debug: 2s
#index.search.slowlog.threshold.query.trace: 500ms

index.search.slowlog.threshold.fetch.warn: 1s
index.search.slowlog.threshold.fetch.info: 800ms
#index.search.slowlog.threshold.fetch.debug: 500ms
#index.search.slowlog.threshold.fetch.trace: 200ms

On Wed, Apr 11, 2012 at 5:12 PM, Kenneth Loafman kenneth@loafman.comwrote:

Here it is:

~$ curl -XGET "localhost:9200/test/_settings?pretty=true"
{
"test" : {
"settings" : {
"index.number_of_shards" : "2",
"index.number_of_replicas" : "1",
"index.version.created" : "190199"
}
}
}

On Wed, Apr 11, 2012 at 5:06 PM, Igor Motov imotov@gmail.com wrote:

I still cannot reproduce it. Something is missing. Most likely it's index
settings. Could you share your config file or add output of the following
command to the repro?

curl -XGET "localhost:9200/test/_settings"

On Wednesday, April 11, 2012 5:13:13 PM UTC-4, Kenneth Loafman wrote:

0.18.7 gives the same results.

BTW, original is here: https://gist.github.com/**2359644https://gist.github.com/2359644

...Ken

On Wed, Apr 11, 2012 at 3:55 PM, Kenneth Loafman kenneth@loafman.comwrote:

Just to make sure, I updated the original gist with a test that has the
full mapping. It does not work. The output is from 0.19.1.

...Ken

On Wed, Apr 11, 2012 at 3:28 PM, Kenneth Loafman kenneth@loafman.comwrote:

It will run without the publish_datetime sort, that's part of the gist
I posted. Does NOT matter whether its populated or not.

...Ken

On Wed, Apr 11, 2012 at 11:13 AM, Igor Motov imotov@gmail.com wrote:

This recreation doesn't seem to recreate the problem. The issue is
most likely in the mapping, and unfortunately, it's not visible in your
example. This is what I am getting when I am trying to run it:
https://gist.github.com/**2360279 https://gist.github.com/2360279 (I had to remove publish_datetime sort, since it's not populated in your
gist).

On Wednesday, April 11, 2012 10:30:30 AM UTC-4, Kenneth Loafman wrote:

Here's the gist: https://gist.github.com/****2359644https://gist.github.com/2359644

...Thanks,
...Ken

On Wed, Apr 11, 2012 at 6:30 AM, Shay Banon kimchy@gmail.comwrote:

Thats strange, the first option simply translates to a boolean
query and the other one translates to a phrase query. Can you gist a
recreation?

On Mon, Apr 9, 2012 at 9:24 PM, Kenneth Loafman <
kenneth.loafman@gmail.com> wrote:

Hi,

We're seeing an odd problem in search that revolves around stop
words.

If I use the query (no quotes):
worst way to travel
ES returns nothing.

If I use the query (with quotes):
"worst way to travel"
ES returns matches.

Why the difference? Do I need to remove stop words manually from
a non-quoted query?

...Ken


(Igor Motov) #11

The tricky part is this:

{"ok":true,"_shards":{"total":4,"successful":3,"failed":0}}

I don't quite understand how this is possible. If you had one node running,
you should have gotten "total":4, "successful":2. If you had two nodes
running, you should have gotten "total":4, "successful":4. Having
"total":4, "successful":3 most likely indicates that there is an issue with
allocating one of the shards on one of the nodes. Do you see any errors in
the log file on one of the nodes?

On Wednesday, April 11, 2012 6:17:11 PM UTC-4, Kenneth Loafman wrote:

Here's elasticsearch.yml just in case...

gateway:
type: local

path :
data : /mnt/search-data-dev/node0/data
work : /mnt/search-data-dev/node0/work
logs : /mnt/search-data-dev/node0/logs

index :
number_of_shards : 2
number_of_replicas : 1

bootstrap:
mlockall: true

action:
disable_delete_all_indexes: true

network :
host : localhost

discovery.zen.ping.multicast:
enabled: false

discovery.zen.ping.unicast:
hosts: ["localhost:9300","localhost:9301"]

Shard level query and fetch threshold logging.

#index.search.slowlog.level: TRACE
index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s
#index.search.slowlog.threshold.query.debug: 2s
#index.search.slowlog.threshold.query.trace: 500ms

index.search.slowlog.threshold.fetch.warn: 1s
index.search.slowlog.threshold.fetch.info: 800ms
#index.search.slowlog.threshold.fetch.debug: 500ms
#index.search.slowlog.threshold.fetch.trace: 200ms

On Wed, Apr 11, 2012 at 5:12 PM, Kenneth Loafman kenneth@loafman.comwrote:

Here it is:

~$ curl -XGET "localhost:9200/test/_settings?pretty=true"
{
"test" : {
"settings" : {
"index.number_of_shards" : "2",
"index.number_of_replicas" : "1",
"index.version.created" : "190199"
}
}
}

On Wed, Apr 11, 2012 at 5:06 PM, Igor Motov imotov@gmail.com wrote:

I still cannot reproduce it. Something is missing. Most likely it's
index settings. Could you share your config file or add output of the
following command to the repro?

curl -XGET "localhost:9200/test/_settings"

On Wednesday, April 11, 2012 5:13:13 PM UTC-4, Kenneth Loafman wrote:

0.18.7 gives the same results.

BTW, original is here: https://gist.github.com/**2359644https://gist.github.com/2359644

...Ken

On Wed, Apr 11, 2012 at 3:55 PM, Kenneth Loafman kenneth@loafman.comwrote:

Just to make sure, I updated the original gist with a test that has
the full mapping. It does not work. The output is from 0.19.1.

...Ken

On Wed, Apr 11, 2012 at 3:28 PM, Kenneth Loafman kenneth@loafman.comwrote:

It will run without the publish_datetime sort, that's part of the
gist I posted. Does NOT matter whether its populated or not.

...Ken

On Wed, Apr 11, 2012 at 11:13 AM, Igor Motov imotov@gmail.comwrote:

This recreation doesn't seem to recreate the problem. The issue is
most likely in the mapping, and unfortunately, it's not visible in your
example. This is what I am getting when I am trying to run it:
https://gist.github.com/**2360279 https://gist.github.com/2360279 (I had to remove publish_datetime sort, since it's not populated in your
gist).

On Wednesday, April 11, 2012 10:30:30 AM UTC-4, Kenneth Loafman
wrote:

Here's the gist: https://gist.github.com/****2359644https://gist.github.com/2359644

...Thanks,
...Ken

On Wed, Apr 11, 2012 at 6:30 AM, Shay Banon kimchy@gmail.comwrote:

Thats strange, the first option simply translates to a boolean
query and the other one translates to a phrase query. Can you gist a
recreation?

On Mon, Apr 9, 2012 at 9:24 PM, Kenneth Loafman <
kenneth.loafman@gmail.com> wrote:

Hi,

We're seeing an odd problem in search that revolves around stop
words.

If I use the query (no quotes):
worst way to travel
ES returns nothing.

If I use the query (with quotes):
"worst way to travel"
ES returns matches.

Why the difference? Do I need to remove stop words manually from
a non-quoted query?

...Ken


(Kenneth Loafman-2) #12

The test script does not do any delays after creating the index, but if you
put in a sleep after creating the index, all 4 shards are created. It does
not happen (the 3/4 shards thing) on other systems, but the lack of search
results happens on all 3 systems.

  1. my desktop, 2 nodes on the same machine, 0.19.1,
  2. remote test machine, same config as above,
  3. production cluster, 4 machines, 1 ES node per machine, 0.18.7

All machines are running Ubuntu Lucid, 64-bit, java 1.6.

I think it's the filtered query that's causing the problem. Stripping out
the filter part makes it work correctly. The query I sent in the gist is
actually a minimized version of the query we use in production, as is the
mapping. The full query fails as well, which is what got this started.

...Ken

On Wed, Apr 11, 2012 at 5:30 PM, Igor Motov imotov@gmail.com wrote:

The tricky part is this:

{"ok":true,"_shards":{"total":4,"successful":3,"failed":0}}

I don't quite understand how this is possible. If you had one node
running, you should have gotten "total":4, "successful":2. If you had two
nodes running, you should have gotten "total":4, "successful":4. Having
"total":4, "successful":3 most likely indicates that there is an issue with
allocating one of the shards on one of the nodes. Do you see any errors in
the log file on one of the nodes?

On Wednesday, April 11, 2012 6:17:11 PM UTC-4, Kenneth Loafman wrote:

Here's elasticsearch.yml just in case...

gateway:
type: local

path :
data : /mnt/search-data-dev/node0/**data
work : /mnt/search-data-dev/node0/**work
logs : /mnt/search-data-dev/node0/**logs

index :
number_of_shards : 2
number_of_replicas : 1

bootstrap:
mlockall: true

action:
disable_delete_all_indexes: true

network :
host : localhost

discovery.zen.ping.multicast:
enabled: false

discovery.zen.ping.unicast:
hosts: ["localhost:9300","localhost:**9301"]

Shard level query and fetch threshold logging.

#index.search.slowlog.level: TRACE
index.search.slowlog.**threshold.query.warn: 10s
index.search.slowlog.**threshold.query.infohttp://index.search.slowlog.threshold.query.info:
5s
#index.search.slowlog.**threshold.query.debug: 2s
#index.search.slowlog.**threshold.query.trace: 500ms

index.search.slowlog.**threshold.fetch.warn: 1s
index.search.slowlog.**threshold.fetch.infohttp://index.search.slowlog.threshold.fetch.info:
800ms
#index.search.slowlog.**threshold.fetch.debug: 500ms
#index.search.slowlog.**threshold.fetch.trace: 200ms

On Wed, Apr 11, 2012 at 5:12 PM, Kenneth Loafman kenneth@loafman.comwrote:

Here it is:

~$ curl -XGET "localhost:9200/test/_**settings?pretty=true"
{
"test" : {
"settings" : {
"index.number_of_shards" : "2",
"index.number_of_replicas" : "1",
"index.version.created" : "190199"
}
}
}

On Wed, Apr 11, 2012 at 5:06 PM, Igor Motov imotov@gmail.com wrote:

I still cannot reproduce it. Something is missing. Most likely it's
index settings. Could you share your config file or add output of the
following command to the repro?

curl -XGET "localhost:9200/test/_**settings"

On Wednesday, April 11, 2012 5:13:13 PM UTC-4, Kenneth Loafman wrote:

0.18.7 gives the same results.

BTW, original is here: https://gist.github.com/****2359644https://gist.github.com/2359644

...Ken

On Wed, Apr 11, 2012 at 3:55 PM, Kenneth Loafman <kenneth@loafman.com

wrote:

Just to make sure, I updated the original gist with a test that has
the full mapping. It does not work. The output is from 0.19.1.

...Ken

On Wed, Apr 11, 2012 at 3:28 PM, Kenneth Loafman <kenneth@loafman.com

wrote:

It will run without the publish_datetime sort, that's part of the
gist I posted. Does NOT matter whether its populated or not.

...Ken

On Wed, Apr 11, 2012 at 11:13 AM, Igor Motov imotov@gmail.comwrote:

This recreation doesn't seem to recreate the problem. The issue is
most likely in the mapping, and unfortunately, it's not visible in your
example. This is what I am getting when I am trying to run it:
https://gist.github.com/2360279https://gist.github.com/2360279 (I had to remove publish_datetime sort, since it's not populated in your
gist).

On Wednesday, April 11, 2012 10:30:30 AM UTC-4, Kenneth Loafman
wrote:

Here's the gist: https://gist.github.com/******2359644https://gist.github.com/2359644

...Thanks,
...Ken

On Wed, Apr 11, 2012 at 6:30 AM, Shay Banon kimchy@gmail.comwrote:

Thats strange, the first option simply translates to a boolean
query and the other one translates to a phrase query. Can you gist a
recreation?

On Mon, Apr 9, 2012 at 9:24 PM, Kenneth Loafman <
kenneth.loafman@gmail.com> wrote:

Hi,

We're seeing an odd problem in search that revolves around stop
words.

If I use the query (no quotes):
worst way to travel
ES returns nothing.

If I use the query (with quotes):
"worst way to travel"
ES returns matches.

Why the difference? Do I need to remove stop words manually
from a non-quoted query?

...Ken


(Kenneth Loafman-2) #13

Ping. Any thoughts?

...Ken

On Wed, Apr 11, 2012 at 8:11 PM, Kenneth Loafman kenneth@loafman.comwrote:

The test script does not do any delays after creating the index, but if
you put in a sleep after creating the index, all 4 shards are created. It
does not happen (the 3/4 shards thing) on other systems, but the lack of
search results happens on all 3 systems.

  1. my desktop, 2 nodes on the same machine, 0.19.1,
  2. remote test machine, same config as above,
  3. production cluster, 4 machines, 1 ES node per machine, 0.18.7

All machines are running Ubuntu Lucid, 64-bit, java 1.6.

I think it's the filtered query that's causing the problem. Stripping out
the filter part makes it work correctly. The query I sent in the gist is
actually a minimized version of the query we use in production, as is the
mapping. The full query fails as well, which is what got this started.

...Ken

On Wed, Apr 11, 2012 at 5:30 PM, Igor Motov imotov@gmail.com wrote:

The tricky part is this:

{"ok":true,"_shards":{"total":4,"successful":3,"failed":0}}

I don't quite understand how this is possible. If you had one node
running, you should have gotten "total":4, "successful":2. If you had two
nodes running, you should have gotten "total":4, "successful":4. Having
"total":4, "successful":3 most likely indicates that there is an issue with
allocating one of the shards on one of the nodes. Do you see any errors in
the log file on one of the nodes?

On Wednesday, April 11, 2012 6:17:11 PM UTC-4, Kenneth Loafman wrote:

Here's elasticsearch.yml just in case...

gateway:
type: local

path :
data : /mnt/search-data-dev/node0/**data
work : /mnt/search-data-dev/node0/**work
logs : /mnt/search-data-dev/node0/**logs

index :
number_of_shards : 2
number_of_replicas : 1

bootstrap:
mlockall: true

action:
disable_delete_all_indexes: true

network :
host : localhost

discovery.zen.ping.multicast:
enabled: false

discovery.zen.ping.unicast:
hosts: ["localhost:9300","localhost:**9301"]

Shard level query and fetch threshold logging.

#index.search.slowlog.level: TRACE
index.search.slowlog.**threshold.query.warn: 10s
index.search.slowlog.**threshold.query.infohttp://index.search.slowlog.threshold.query.info:
5s
#index.search.slowlog.**threshold.query.debug: 2s
#index.search.slowlog.**threshold.query.trace: 500ms

index.search.slowlog.**threshold.fetch.warn: 1s
index.search.slowlog.**threshold.fetch.infohttp://index.search.slowlog.threshold.fetch.info:
800ms
#index.search.slowlog.**threshold.fetch.debug: 500ms
#index.search.slowlog.**threshold.fetch.trace: 200ms

On Wed, Apr 11, 2012 at 5:12 PM, Kenneth Loafman kenneth@loafman.comwrote:

Here it is:

~$ curl -XGET "localhost:9200/test/_**settings?pretty=true"
{
"test" : {
"settings" : {
"index.number_of_shards" : "2",
"index.number_of_replicas" : "1",
"index.version.created" : "190199"
}
}
}

On Wed, Apr 11, 2012 at 5:06 PM, Igor Motov imotov@gmail.com wrote:

I still cannot reproduce it. Something is missing. Most likely it's
index settings. Could you share your config file or add output of the
following command to the repro?

curl -XGET "localhost:9200/test/_**settings"

On Wednesday, April 11, 2012 5:13:13 PM UTC-4, Kenneth Loafman wrote:

0.18.7 gives the same results.

BTW, original is here: https://gist.github.com/****2359644https://gist.github.com/2359644

...Ken

On Wed, Apr 11, 2012 at 3:55 PM, Kenneth Loafman <
kenneth@loafman.com> wrote:

Just to make sure, I updated the original gist with a test that has
the full mapping. It does not work. The output is from 0.19.1.

...Ken

On Wed, Apr 11, 2012 at 3:28 PM, Kenneth Loafman <
kenneth@loafman.com> wrote:

It will run without the publish_datetime sort, that's part of the
gist I posted. Does NOT matter whether its populated or not.

...Ken

On Wed, Apr 11, 2012 at 11:13 AM, Igor Motov imotov@gmail.comwrote:

This recreation doesn't seem to recreate the problem. The issue is
most likely in the mapping, and unfortunately, it's not visible in your
example. This is what I am getting when I am trying to run it:
https://gist.github.com/2360279https://gist.github.com/2360279 (I had to remove publish_datetime sort, since it's not populated in your
gist).

On Wednesday, April 11, 2012 10:30:30 AM UTC-4, Kenneth Loafman
wrote:

Here's the gist: https://gist.github.com/******2359644https://gist.github.com/2359644

...Thanks,
...Ken

On Wed, Apr 11, 2012 at 6:30 AM, Shay Banon kimchy@gmail.comwrote:

Thats strange, the first option simply translates to a boolean
query and the other one translates to a phrase query. Can you gist a
recreation?

On Mon, Apr 9, 2012 at 9:24 PM, Kenneth Loafman <
kenneth.loafman@gmail.com> wrote:

Hi,

We're seeing an odd problem in search that revolves around stop
words.

If I use the query (no quotes):
worst way to travel
ES returns nothing.

If I use the query (with quotes):
"worst way to travel"
ES returns matches.

Why the difference? Do I need to remove stop words manually
from a non-quoted query?

...Ken


(Igor Motov) #14

I ran out of ideas. I tried reproducing it on Mac and Ubuntu using your
config and 3 different versions of java without success. I get results for
both searches every time.

Can you reproduce it with a freshly downloaded version of elasticsearch?
Basically, download elasticsearch 0.19.1 from elasticsearch.org, unzip
it, don't change configuration, start two nodes and run stop-words-test.sh.

On Thursday, April 12, 2012 9:04:01 PM UTC-4, Kenneth Loafman wrote:

Ping. Any thoughts?

...Ken

On Wed, Apr 11, 2012 at 8:11 PM, Kenneth Loafman kenneth@loafman.comwrote:

The test script does not do any delays after creating the index, but if
you put in a sleep after creating the index, all 4 shards are created. It
does not happen (the 3/4 shards thing) on other systems, but the lack of
search results happens on all 3 systems.

  1. my desktop, 2 nodes on the same machine, 0.19.1,
  2. remote test machine, same config as above,
  3. production cluster, 4 machines, 1 ES node per machine, 0.18.7

All machines are running Ubuntu Lucid, 64-bit, java 1.6.

I think it's the filtered query that's causing the problem. Stripping
out the filter part makes it work correctly. The query I sent in the gist
is actually a minimized version of the query we use in production, as is
the mapping. The full query fails as well, which is what got this started.

...Ken

On Wed, Apr 11, 2012 at 5:30 PM, Igor Motov imotov@gmail.com wrote:

The tricky part is this:

{"ok":true,"_shards":{"total":4,"successful":3,"failed":0}}

I don't quite understand how this is possible. If you had one node
running, you should have gotten "total":4, "successful":2. If you had two
nodes running, you should have gotten "total":4, "successful":4. Having
"total":4, "successful":3 most likely indicates that there is an issue with
allocating one of the shards on one of the nodes. Do you see any errors in
the log file on one of the nodes?

On Wednesday, April 11, 2012 6:17:11 PM UTC-4, Kenneth Loafman wrote:

Here's elasticsearch.yml just in case...

gateway:
type: local

path :
data : /mnt/search-data-dev/node0/**data
work : /mnt/search-data-dev/node0/**work
logs : /mnt/search-data-dev/node0/**logs

index :
number_of_shards : 2
number_of_replicas : 1

bootstrap:
mlockall: true

action:
disable_delete_all_indexes: true

network :
host : localhost

discovery.zen.ping.multicast:
enabled: false

discovery.zen.ping.unicast:
hosts: ["localhost:9300","localhost:**9301"]

Shard level query and fetch threshold logging.

#index.search.slowlog.level: TRACE
index.search.slowlog.**threshold.query.warn: 10s
index.search.slowlog.**threshold.query.infohttp://index.search.slowlog.threshold.query.info:
5s
#index.search.slowlog.**threshold.query.debug: 2s
#index.search.slowlog.**threshold.query.trace: 500ms

index.search.slowlog.**threshold.fetch.warn: 1s
index.search.slowlog.**threshold.fetch.infohttp://index.search.slowlog.threshold.fetch.info:
800ms
#index.search.slowlog.**threshold.fetch.debug: 500ms
#index.search.slowlog.**threshold.fetch.trace: 200ms

On Wed, Apr 11, 2012 at 5:12 PM, Kenneth Loafman kenneth@loafman.comwrote:

Here it is:

~$ curl -XGET "localhost:9200/test/_**settings?pretty=true"
{
"test" : {
"settings" : {
"index.number_of_shards" : "2",
"index.number_of_replicas" : "1",
"index.version.created" : "190199"
}
}
}

On Wed, Apr 11, 2012 at 5:06 PM, Igor Motov imotov@gmail.com wrote:

I still cannot reproduce it. Something is missing. Most likely it's
index settings. Could you share your config file or add output of the
following command to the repro?

curl -XGET "localhost:9200/test/_**settings"

On Wednesday, April 11, 2012 5:13:13 PM UTC-4, Kenneth Loafman wrote:

0.18.7 gives the same results.

BTW, original is here: https://gist.github.com/****2359644https://gist.github.com/2359644

...Ken

On Wed, Apr 11, 2012 at 3:55 PM, Kenneth Loafman <
kenneth@loafman.com> wrote:

Just to make sure, I updated the original gist with a test that has
the full mapping. It does not work. The output is from 0.19.1.

...Ken

On Wed, Apr 11, 2012 at 3:28 PM, Kenneth Loafman <
kenneth@loafman.com> wrote:

It will run without the publish_datetime sort, that's part of the
gist I posted. Does NOT matter whether its populated or not.

...Ken

On Wed, Apr 11, 2012 at 11:13 AM, Igor Motov imotov@gmail.comwrote:

This recreation doesn't seem to recreate the problem. The issue
is most likely in the mapping, and unfortunately, it's not visible in your
example. This is what I am getting when I am trying to run it:
https://gist.github.com/2360279https://gist.github.com/2360279 (I had to remove publish_datetime sort, since it's not populated in your
gist).

On Wednesday, April 11, 2012 10:30:30 AM UTC-4, Kenneth Loafman
wrote:

Here's the gist: https://gist.github.com/******2359644https://gist.github.com/2359644

...Thanks,
...Ken

On Wed, Apr 11, 2012 at 6:30 AM, Shay Banon kimchy@gmail.comwrote:

Thats strange, the first option simply translates to a boolean
query and the other one translates to a phrase query. Can you gist a
recreation?

On Mon, Apr 9, 2012 at 9:24 PM, Kenneth Loafman <
kenneth.loafman@gmail.com> wrote:

Hi,

We're seeing an odd problem in search that revolves around
stop words.

If I use the query (no quotes):
worst way to travel
ES returns nothing.

If I use the query (with quotes):
"worst way to travel"
ES returns matches.

Why the difference? Do I need to remove stop words manually
from a non-quoted query?

...Ken


(Shay Banon) #15

The refresh result of total 3 failed 0 can happen, effectively, the index
gets created, and shards are starting to be allocated. The index requests
will work once a primary shard is available. Then the refresh goes and not
all replica shards have been allocated yet. I don't think thats the
problem, but you can add a call to the health API with wait_for_status set
to green after the index gets created. But, does that solve the problem?
Thats what I missed here...

On Fri, Apr 13, 2012 at 4:38 AM, Igor Motov imotov@gmail.com wrote:

I ran out of ideas. I tried reproducing it on Mac and Ubuntu using your
config and 3 different versions of java without success. I get results for
both searches every time.

Can you reproduce it with a freshly downloaded version of elasticsearch?
Basically, download elasticsearch 0.19.1 from elasticsearch.org, unzip
it, don't change configuration, start two nodes and run stop-words-test.sh.

On Thursday, April 12, 2012 9:04:01 PM UTC-4, Kenneth Loafman wrote:

Ping. Any thoughts?

...Ken

On Wed, Apr 11, 2012 at 8:11 PM, Kenneth Loafman kenneth@loafman.comwrote:

The test script does not do any delays after creating the index, but if
you put in a sleep after creating the index, all 4 shards are created. It
does not happen (the 3/4 shards thing) on other systems, but the lack of
search results happens on all 3 systems.

  1. my desktop, 2 nodes on the same machine, 0.19.1,
  2. remote test machine, same config as above,
  3. production cluster, 4 machines, 1 ES node per machine, 0.18.7

All machines are running Ubuntu Lucid, 64-bit, java 1.6.

I think it's the filtered query that's causing the problem. Stripping
out the filter part makes it work correctly. The query I sent in the gist
is actually a minimized version of the query we use in production, as is
the mapping. The full query fails as well, which is what got this started.

...Ken

On Wed, Apr 11, 2012 at 5:30 PM, Igor Motov imotov@gmail.com wrote:

The tricky part is this:

{"ok":true,"_shards":{"total":**4,"successful":3,"failed":0}}

I don't quite understand how this is possible. If you had one node
running, you should have gotten "total":4, "successful":2. If you had two
nodes running, you should have gotten "total":4, "successful":4. Having
"total":4, "successful":3 most likely indicates that there is an issue with
allocating one of the shards on one of the nodes. Do you see any errors in
the log file on one of the nodes?

On Wednesday, April 11, 2012 6:17:11 PM UTC-4, Kenneth Loafman wrote:

Here's elasticsearch.yml just in case...

gateway:
type: local

path :
data : /mnt/search-data-dev/node0/data
work : /mnt/search-data-dev/node0/work
logs : /mnt/search-data-dev/node0/logs

index :
number_of_shards : 2
number_of_replicas : 1

bootstrap:
mlockall: true

action:
disable_delete_all_indexes: true

network :
host : localhost

discovery.zen.ping.multicast:
enabled: false

discovery.zen.ping.unicast:
hosts: ["localhost:9300","localhost:9301"]

Shard level query and fetch threshold logging.

#index.search.slowlog.level: TRACE
index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.infohttp://index.search.slowlog.threshold.query.info:
5s
#index.search.slowlog.threshold.query.debug: 2s
#index.search.slowlog.threshold.query.trace: 500ms

index.search.slowlog.threshold.fetch.warn: 1s
index.search.slowlog.threshold.fetch.infohttp://index.search.slowlog.threshold.fetch.info:
800ms
#index.search.slowlog.threshold.fetch.debug: 500ms
#index.search.slowlog.threshold.fetch.trace: 200ms

On Wed, Apr 11, 2012 at 5:12 PM, Kenneth Loafman kenneth@loafman.comwrote:

Here it is:

~$ curl -XGET "localhost:9200/test/_settings?pretty=true"
{
"test" : {
"settings" : {
"index.number_of_shards" : "2",
"index.number_of_replicas" : "1",
"index.version.created" : "190199"
}
}
}

On Wed, Apr 11, 2012 at 5:06 PM, Igor Motov imotov@gmail.com wrote:

I still cannot reproduce it. Something is missing. Most likely it's
index settings. Could you share your config file or add output of the
following command to the repro?

curl -XGET "localhost:9200/test/_settings"

On Wednesday, April 11, 2012 5:13:13 PM UTC-4, Kenneth Loafman wrote:

0.18.7 gives the same results.

BTW, original is here: https://gist.github.com/******2359644https://gist.github.com/2359644

...Ken

On Wed, Apr 11, 2012 at 3:55 PM, Kenneth Loafman <
kenneth@loafman.com> wrote:

Just to make sure, I updated the original gist with a test that
has the full mapping. It does not work. The output is from 0.19.1.

...Ken

On Wed, Apr 11, 2012 at 3:28 PM, Kenneth Loafman <
kenneth@loafman.com> wrote:

It will run without the publish_datetime sort, that's part of the
gist I posted. Does NOT matter whether its populated or not.

...Ken

On Wed, Apr 11, 2012 at 11:13 AM, Igor Motov imotov@gmail.comwrote:

This recreation doesn't seem to recreate the problem. The issue
is most likely in the mapping, and unfortunately, it's not visible in your
example. This is what I am getting when I am trying to run it:
https://gist.github.com/**23****60279https://gist.github.com/2360279 (I had to remove publish_datetime sort, since it's not populated in your
gist).

On Wednesday, April 11, 2012 10:30:30 AM UTC-4, Kenneth Loafman
wrote:

Here's the gist: https://gist.github.com/********2359644https://gist.github.com/2359644

...Thanks,
...Ken

On Wed, Apr 11, 2012 at 6:30 AM, Shay Banon kimchy@gmail.comwrote:

Thats strange, the first option simply translates to a boolean
query and the other one translates to a phrase query. Can you gist a
recreation?

On Mon, Apr 9, 2012 at 9:24 PM, Kenneth Loafman <
kenneth.loafman@gmail.com> wrote:

Hi,

We're seeing an odd problem in search that revolves around
stop words.

If I use the query (no quotes):
worst way to travel
ES returns nothing.

If I use the query (with quotes):
"worst way to travel"
ES returns matches.

Why the difference? Do I need to remove stop words manually
from a non-quoted query?

...Ken


(Kenneth Loafman-2) #16

Thanks for the help Shay and Igor.

I tried Igor's request on my desktop and it worked. I upgraded my desktop
to 0.19.2 and it worked. I upgraded our test machine to 0.19.2, and it
failed. My desktop and the test machine have identical setups, so now I'm
even more confused. This may resolve down to something as trivial as a
minor version number on some support package. AArgh!

On Fri, Apr 13, 2012 at 7:39 AM, Shay Banon kimchy@gmail.com wrote:

The refresh result of total 3 failed 0 can happen, effectively, the index
gets created, and shards are starting to be allocated. The index requests
will work once a primary shard is available. Then the refresh goes and not
all replica shards have been allocated yet. I don't think thats the
problem, but you can add a call to the health API with wait_for_status set
to green after the index gets created. But, does that solve the problem?
Thats what I missed here...

On Fri, Apr 13, 2012 at 4:38 AM, Igor Motov imotov@gmail.com wrote:

I ran out of ideas. I tried reproducing it on Mac and Ubuntu using your
config and 3 different versions of java without success. I get results for
both searches every time.

Can you reproduce it with a freshly downloaded version of elasticsearch?
Basically, download elasticsearch 0.19.1 from elasticsearch.org, unzip
it, don't change configuration, start two nodes and run stop-words-test.sh.

On Thursday, April 12, 2012 9:04:01 PM UTC-4, Kenneth Loafman wrote:

Ping. Any thoughts?

...Ken

On Wed, Apr 11, 2012 at 8:11 PM, Kenneth Loafman kenneth@loafman.comwrote:

The test script does not do any delays after creating the index, but if
you put in a sleep after creating the index, all 4 shards are created. It
does not happen (the 3/4 shards thing) on other systems, but the lack of
search results happens on all 3 systems.

  1. my desktop, 2 nodes on the same machine, 0.19.1,
  2. remote test machine, same config as above,
  3. production cluster, 4 machines, 1 ES node per machine, 0.18.7

All machines are running Ubuntu Lucid, 64-bit, java 1.6.

I think it's the filtered query that's causing the problem. Stripping
out the filter part makes it work correctly. The query I sent in the gist
is actually a minimized version of the query we use in production, as is
the mapping. The full query fails as well, which is what got this started.

...Ken

On Wed, Apr 11, 2012 at 5:30 PM, Igor Motov imotov@gmail.com wrote:

The tricky part is this:

{"ok":true,"_shards":{"total":**4,"successful":3,"failed":0}}

I don't quite understand how this is possible. If you had one node
running, you should have gotten "total":4, "successful":2. If you had two
nodes running, you should have gotten "total":4, "successful":4. Having
"total":4, "successful":3 most likely indicates that there is an issue with
allocating one of the shards on one of the nodes. Do you see any errors in
the log file on one of the nodes?

On Wednesday, April 11, 2012 6:17:11 PM UTC-4, Kenneth Loafman wrote:

Here's elasticsearch.yml just in case...

gateway:
type: local

path :
data : /mnt/search-data-dev/node0/data
work : /mnt/search-data-dev/node0/work
logs : /mnt/search-data-dev/node0/logs

index :
number_of_shards : 2
number_of_replicas : 1

bootstrap:
mlockall: true

action:
disable_delete_all_indexes: true

network :
host : localhost

discovery.zen.ping.multicast:
enabled: false

discovery.zen.ping.unicast:
hosts: ["localhost:9300","localhost:9301"]

Shard level query and fetch threshold logging.

#index.search.slowlog.level: TRACE
index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.infohttp://index.search.slowlog.threshold.query.info:
5s
#index.search.slowlog.threshold.query.debug: 2s
#index.search.slowlog.threshold.query.trace: 500ms

index.search.slowlog.threshold.fetch.warn: 1s
index.search.slowlog.threshold.fetch.infohttp://index.search.slowlog.threshold.fetch.info:
800ms
#index.search.slowlog.threshold.fetch.debug: 500ms
#index.search.slowlog.threshold.fetch.trace: 200ms

On Wed, Apr 11, 2012 at 5:12 PM, Kenneth Loafman <kenneth@loafman.com

wrote:

Here it is:

~$ curl -XGET "localhost:9200/test/_settings?pretty=true"
{
"test" : {
"settings" : {
"index.number_of_shards" : "2",
"index.number_of_replicas" : "1",
"index.version.created" : "190199"
}
}
}

On Wed, Apr 11, 2012 at 5:06 PM, Igor Motov imotov@gmail.comwrote:

I still cannot reproduce it. Something is missing. Most likely it's
index settings. Could you share your config file or add output of the
following command to the repro?

curl -XGET "localhost:9200/test/_settings"

On Wednesday, April 11, 2012 5:13:13 PM UTC-4, Kenneth Loafman
wrote:

0.18.7 gives the same results.

BTW, original is here: https://gist.github.com/******2359644https://gist.github.com/2359644

...Ken

On Wed, Apr 11, 2012 at 3:55 PM, Kenneth Loafman <
kenneth@loafman.com> wrote:

Just to make sure, I updated the original gist with a test that
has the full mapping. It does not work. The output is from 0.19.1.

...Ken

On Wed, Apr 11, 2012 at 3:28 PM, Kenneth Loafman <
kenneth@loafman.com> wrote:

It will run without the publish_datetime sort, that's part of
the gist I posted. Does NOT matter whether its populated or not.

...Ken

On Wed, Apr 11, 2012 at 11:13 AM, Igor Motov imotov@gmail.comwrote:

This recreation doesn't seem to recreate the problem. The issue
is most likely in the mapping, and unfortunately, it's not visible in your
example. This is what I am getting when I am trying to run it:
https://gist.github.com/**23****60279https://gist.github.com/2360279 (I had to remove publish_datetime sort, since it's not populated in your
gist).

On Wednesday, April 11, 2012 10:30:30 AM UTC-4, Kenneth Loafman
wrote:

Here's the gist: https://gist.github.com/********2359644https://gist.github.com/2359644

...Thanks,
...Ken

On Wed, Apr 11, 2012 at 6:30 AM, Shay Banon kimchy@gmail.comwrote:

Thats strange, the first option simply translates to a
boolean query and the other one translates to a phrase query. Can you gist
a recreation?

On Mon, Apr 9, 2012 at 9:24 PM, Kenneth Loafman <
kenneth.loafman@gmail.com> wrote:

Hi,

We're seeing an odd problem in search that revolves around
stop words.

If I use the query (no quotes):
worst way to travel
ES returns nothing.

If I use the query (with quotes):
"worst way to travel"
ES returns matches.

Why the difference? Do I need to remove stop words manually
from a non-quoted query?

...Ken


(Igor Motov) #17

So, it might be timing or environment issue then. Here is a few things you
can try that might helps us to figure out what it is:

  1. repeat the second query several times on the machine where it fails to
    see if it will start working
  2. add curl -XGET
    'http://127.0.0.2:9200/_cluster/health?waitForStatus=green' before _refresh
  3. compare version of java and filesystem where data is located.

On Friday, April 13, 2012 2:35:59 PM UTC-4, Kenneth Loafman wrote:

Thanks for the help Shay and Igor.

I tried Igor's request on my desktop and it worked. I upgraded my desktop
to 0.19.2 and it worked. I upgraded our test machine to 0.19.2, and it
failed. My desktop and the test machine have identical setups, so now I'm
even more confused. This may resolve down to something as trivial as a
minor version number on some support package. AArgh!

On Fri, Apr 13, 2012 at 7:39 AM, Shay Banon kimchy@gmail.com wrote:

The refresh result of total 3 failed 0 can happen, effectively, the index
gets created, and shards are starting to be allocated. The index requests
will work once a primary shard is available. Then the refresh goes and not
all replica shards have been allocated yet. I don't think thats the
problem, but you can add a call to the health API with wait_for_status set
to green after the index gets created. But, does that solve the problem?
Thats what I missed here...

On Fri, Apr 13, 2012 at 4:38 AM, Igor Motov imotov@gmail.com wrote:

I ran out of ideas. I tried reproducing it on Mac and Ubuntu using your
config and 3 different versions of java without success. I get results for
both searches every time.

Can you reproduce it with a freshly downloaded version of elasticsearch?
Basically, download elasticsearch 0.19.1 from elasticsearch.org, unzip
it, don't change configuration, start two nodes and run stop-words-test.sh.

On Thursday, April 12, 2012 9:04:01 PM UTC-4, Kenneth Loafman wrote:

Ping. Any thoughts?

...Ken

On Wed, Apr 11, 2012 at 8:11 PM, Kenneth Loafman kenneth@loafman.comwrote:

The test script does not do any delays after creating the index, but
if you put in a sleep after creating the index, all 4 shards are created.
It does not happen (the 3/4 shards thing) on other systems, but the lack
of search results happens on all 3 systems.

  1. my desktop, 2 nodes on the same machine, 0.19.1,
  2. remote test machine, same config as above,
  3. production cluster, 4 machines, 1 ES node per machine, 0.18.7

All machines are running Ubuntu Lucid, 64-bit, java 1.6.

I think it's the filtered query that's causing the problem. Stripping
out the filter part makes it work correctly. The query I sent in the gist
is actually a minimized version of the query we use in production, as is
the mapping. The full query fails as well, which is what got this started.

...Ken

On Wed, Apr 11, 2012 at 5:30 PM, Igor Motov imotov@gmail.com wrote:

The tricky part is this:

{"ok":true,"_shards":{"total":**4,"successful":3,"failed":0}}

I don't quite understand how this is possible. If you had one node
running, you should have gotten "total":4, "successful":2. If you had two
nodes running, you should have gotten "total":4, "successful":4. Having
"total":4, "successful":3 most likely indicates that there is an issue with
allocating one of the shards on one of the nodes. Do you see any errors in
the log file on one of the nodes?

On Wednesday, April 11, 2012 6:17:11 PM UTC-4, Kenneth Loafman wrote:

Here's elasticsearch.yml just in case...

gateway:
type: local

path :
data : /mnt/search-data-dev/node0/data
work : /mnt/search-data-dev/node0/work
logs : /mnt/search-data-dev/node0/logs

index :
number_of_shards : 2
number_of_replicas : 1

bootstrap:
mlockall: true

action:
disable_delete_all_indexes: true

network :
host : localhost

discovery.zen.ping.multicast:
enabled: false

discovery.zen.ping.unicast:
hosts: ["localhost:9300","localhost:9301"]

Shard level query and fetch threshold logging.

#index.search.slowlog.level: TRACE
index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.infohttp://index.search.slowlog.threshold.query.info:
5s
#index.search.slowlog.threshold.query.debug: 2s
#index.search.slowlog.threshold.query.trace: 500ms

index.search.slowlog.threshold.fetch.warn: 1s
index.search.slowlog.threshold.fetch.infohttp://index.search.slowlog.threshold.fetch.info:
800ms
#index.search.slowlog.threshold.fetch.debug: 500ms
#index.search.slowlog.threshold.fetch.trace: 200ms

On Wed, Apr 11, 2012 at 5:12 PM, Kenneth Loafman <
kenneth@loafman.com> wrote:

Here it is:

~$ curl -XGET "localhost:9200/test/_settings?pretty=true"
{
"test" : {
"settings" : {
"index.number_of_shards" : "2",
"index.number_of_replicas" : "1",
"index.version.created" : "190199"
}
}
}

On Wed, Apr 11, 2012 at 5:06 PM, Igor Motov imotov@gmail.comwrote:

I still cannot reproduce it. Something is missing. Most likely
it's index settings. Could you share your config file or add output of the
following command to the repro?

curl -XGET "localhost:9200/test/_settings"

On Wednesday, April 11, 2012 5:13:13 PM UTC-4, Kenneth Loafman
wrote:

0.18.7 gives the same results.

BTW, original is here: https://gist.github.com/******2359644https://gist.github.com/2359644

...Ken

On Wed, Apr 11, 2012 at 3:55 PM, Kenneth Loafman <
kenneth@loafman.com> wrote:

Just to make sure, I updated the original gist with a test that
has the full mapping. It does not work. The output is from 0.19.1.

...Ken

On Wed, Apr 11, 2012 at 3:28 PM, Kenneth Loafman <
kenneth@loafman.com> wrote:

It will run without the publish_datetime sort, that's part of
the gist I posted. Does NOT matter whether its populated or not.

...Ken

On Wed, Apr 11, 2012 at 11:13 AM, Igor Motov imotov@gmail.comwrote:

This recreation doesn't seem to recreate the problem. The
issue is most likely in the mapping, and unfortunately, it's not visible in
your example. This is what I am getting when I am trying to run it:
https://gist.github.com/**23****60279https://gist.github.com/2360279 (I had to remove publish_datetime sort, since it's not populated in your
gist).

On Wednesday, April 11, 2012 10:30:30 AM UTC-4, Kenneth
Loafman wrote:

Here's the gist: https://gist.github.com/********2359644https://gist.github.com/2359644

...Thanks,
...Ken

On Wed, Apr 11, 2012 at 6:30 AM, Shay Banon <kimchy@gmail.com

wrote:

Thats strange, the first option simply translates to a
boolean query and the other one translates to a phrase query. Can you gist
a recreation?

On Mon, Apr 9, 2012 at 9:24 PM, Kenneth Loafman <
kenneth.loafman@gmail.com> wrote:

Hi,

We're seeing an odd problem in search that revolves around
stop words.

If I use the query (no quotes):
worst way to travel
ES returns nothing.

If I use the query (with quotes):
"worst way to travel"
ES returns matches.

Why the difference? Do I need to remove stop words
manually from a non-quoted query?

...Ken


(Kenneth Loafman-2) #18

It's turned out to be a Java version issue. My development machine
has sun-java6-jre 6.26-2lucid1 while the rest have 6.24-1build0. I managed
to update while it was still available. The 6.26 version has been deleted
due to a license conflict with Oracle, so I've got a working version on my
desktop where the rest are borked.

Igor, I tried 1 & 2 from your last message, but 3 was the real kicker.
Thanks.

...Ken

On Fri, Apr 13, 2012 at 1:56 PM, Igor Motov imotov@gmail.com wrote:

So, it might be timing or environment issue then. Here is a few things you
can try that might helps us to figure out what it is:

  1. repeat the second query several times on the machine where it fails to
    see if it will start working
  2. add curl -XGET '
    http://127.0.0.2:9200/_cluster/health?waitForStatus=green' before _refresh
  3. compare version of java and filesystem where data is located.

On Friday, April 13, 2012 2:35:59 PM UTC-4, Kenneth Loafman wrote:

Thanks for the help Shay and Igor.

I tried Igor's request on my desktop and it worked. I upgraded my
desktop to 0.19.2 and it worked. I upgraded our test machine to 0.19.2,
and it failed. My desktop and the test machine have identical setups, so
now I'm even more confused. This may resolve down to something as trivial
as a minor version number on some support package. AArgh!

On Fri, Apr 13, 2012 at 7:39 AM, Shay Banon kimchy@gmail.com wrote:

The refresh result of total 3 failed 0 can happen, effectively, the
index gets created, and shards are starting to be allocated. The index
requests will work once a primary shard is available. Then the refresh goes
and not all replica shards have been allocated yet. I don't think thats the
problem, but you can add a call to the health API with wait_for_status set
to green after the index gets created. But, does that solve the problem?
Thats what I missed here...

On Fri, Apr 13, 2012 at 4:38 AM, Igor Motov imotov@gmail.com wrote:

I ran out of ideas. I tried reproducing it on Mac and Ubuntu using your
config and 3 different versions of java without success. I get results for
both searches every time.

Can you reproduce it with a freshly downloaded version of
elasticsearch? Basically, download elasticsearch 0.19.1 from
elasticsearch.org, unzip it, don't change configuration, start two
nodes and run stop-words-test.sh.

On Thursday, April 12, 2012 9:04:01 PM UTC-4, Kenneth Loafman wrote:

Ping. Any thoughts?

...Ken

On Wed, Apr 11, 2012 at 8:11 PM, Kenneth Loafman kenneth@loafman.comwrote:

The test script does not do any delays after creating the index, but
if you put in a sleep after creating the index, all 4 shards are created.
It does not happen (the 3/4 shards thing) on other systems, but the lack
of search results happens on all 3 systems.

  1. my desktop, 2 nodes on the same machine, 0.19.1,
  2. remote test machine, same config as above,
  3. production cluster, 4 machines, 1 ES node per machine, 0.18.7

All machines are running Ubuntu Lucid, 64-bit, java 1.6.

I think it's the filtered query that's causing the problem.
Stripping out the filter part makes it work correctly. The query I sent
in the gist is actually a minimized version of the query we use in
production, as is the mapping. The full query fails as well, which is what
got this started.

...Ken

On Wed, Apr 11, 2012 at 5:30 PM, Igor Motov imotov@gmail.com wrote:

The tricky part is this:

{"ok":true,"_shards":{"total":**4,"successful":3,"failed":0}}

I don't quite understand how this is possible. If you had one node
running, you should have gotten "total":4, "successful":2. If you had two
nodes running, you should have gotten "total":4, "successful":4. Having
"total":4, "successful":3 most likely indicates that there is an issue with
allocating one of the shards on one of the nodes. Do you see any errors in
the log file on one of the nodes?

On Wednesday, April 11, 2012 6:17:11 PM UTC-4, Kenneth Loafman wrote:

Here's elasticsearch.yml just in case...

gateway:
type: local

path :
data : /mnt/search-data-dev/node0/**data
work : /mnt/search-data-dev/node0/**wor
k
logs : /mnt/search-data-dev/node0/**log****s

index :
number_of_shards : 2
number_of_replicas : 1

bootstrap:
mlockall: true

action:
disable_delete_all_indexes: true

network :
host : localhost

discovery.zen.ping.multicast:
enabled: false

discovery.zen.ping.unicast:
hosts: ["localhost:9300","localhost:**9****301"]

Shard level query and fetch threshold logging.

#index.search.slowlog.level: TRACE
index.search.slowlog.threshold**.query.warn: 10s
index.search.slowlog.threshold**.query.infohttp://index.search.slowlog.threshold.query.info:
5s
#index.search.slowlog.**threshold.query.debug: 2s
#index.search.slowlog.**threshol
d.query.trace: 500ms

index.search.slowlog.threshold**.fetch.warn: 1s
index.search.slowlog.threshold**.fetch.infohttp://index.search.slowlog.threshold.fetch.info:
800ms
#index.search.slowlog.**threshold.fetch.debug: 500ms
#index.search.slowlog.**threshol
d.fetch.trace: 200ms

On Wed, Apr 11, 2012 at 5:12 PM, Kenneth Loafman <
kenneth@loafman.com> wrote:

Here it is:

~$ curl -XGET "localhost:9200/test/_settings**?pretty=true"
{
"test" : {
"settings" : {
"index.number_of_shards" : "2",
"index.number_of_replicas" : "1",
"index.version.created" : "190199"
}
}
}

On Wed, Apr 11, 2012 at 5:06 PM, Igor Motov imotov@gmail.comwrote:

I still cannot reproduce it. Something is missing. Most likely
it's index settings. Could you share your config file or add output of the
following command to the repro?

curl -XGET "localhost:9200/test/_settings**"

On Wednesday, April 11, 2012 5:13:13 PM UTC-4, Kenneth Loafman
wrote:

0.18.7 gives the same results.

BTW, original is here: https://gist.github.com/********2359644https://gist.github.com/2359644

...Ken

On Wed, Apr 11, 2012 at 3:55 PM, Kenneth Loafman <
kenneth@loafman.com> wrote:

Just to make sure, I updated the original gist with a test that
has the full mapping. It does not work. The output is from 0.19.1.

...Ken

On Wed, Apr 11, 2012 at 3:28 PM, Kenneth Loafman <
kenneth@loafman.com> wrote:

It will run without the publish_datetime sort, that's part of
the gist I posted. Does NOT matter whether its populated or not.

...Ken

On Wed, Apr 11, 2012 at 11:13 AM, Igor Motov <imotov@gmail.com

wrote:

This recreation doesn't seem to recreate the problem. The
issue is most likely in the mapping, and unfortunately, it's not visible in
your example. This is what I am getting when I am trying to run it:
https://gist.github.com/23****60279https://gist.github.com/2360279 (I had to remove publish_datetime sort, since it's not populated in your
gist).

On Wednesday, April 11, 2012 10:30:30 AM UTC-4, Kenneth
Loafman wrote:

Here's the gist: https://gist.github.com/**********2359644https://gist.github.com/2359644

...Thanks,
...Ken

On Wed, Apr 11, 2012 at 6:30 AM, Shay Banon <
kimchy@gmail.com> wrote:

Thats strange, the first option simply translates to a
boolean query and the other one translates to a phrase query. Can you gist
a recreation?

On Mon, Apr 9, 2012 at 9:24 PM, Kenneth Loafman <
kenneth.loafman@gmail.com> wrote:

Hi,

We're seeing an odd problem in search that revolves around
stop words.

If I use the query (no quotes):
worst way to travel
ES returns nothing.

If I use the query (with quotes):
"worst way to travel"
ES returns matches.

Why the difference? Do I need to remove stop words
manually from a non-quoted query?

...Ken


(system) #19