ES defaults not exhibiting typical Lucene behavior


(Allison A.) #1

I have a field "CategoryMajor"

I have two documents:
Doc 1: CategoryMajor: Restaurants
Doc 2: CategoryMajor: Restaurants, Restaurants, Restaurants, Restaurants,
Restaurants

If I search for CategoryMajor:Restaurants, then Doc #1 is more relevant
than Doc #2.

Why is this, and how do I remedy this?

Thanks,
Allison

--


(Allison A.) #2

I've provided some output below. The 4 instances of the term should be most
relevant, then the two, and one last. Yet this is not the case...

http://localhost:9200/test/type1/_search?pretty=true&q=cat:restaurants

{
"took" : 10,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 0.30685282,
"hits" : [ {
"_index" : "test",
"_type" : "type1",
"_id" : "doc2",
"_score" : 0.30685282, "_source" : {"cat": "restaurants restaurants restaurants restaurants"}
}, {
"_index" : "test",
"_type" : "type1",
"_id" : "doc3",
"_score" : 0.30685282, "_source" : {"cat": "restaurants"}
}, {
"_index" : "test",
"_type" : "type1",
"_id" : "doc1",
"_score" : 0.2712221, "_source" : {"cat": "restaurants restaurants"}
} ]
}
}

On Friday, September 7, 2012 1:11:52 PM UTC-4, Allison A. wrote:

I have a field "CategoryMajor"

I have two documents:
Doc 1: CategoryMajor: Restaurants
Doc 2: CategoryMajor: Restaurants, Restaurants, Restaurants, Restaurants,
Restaurants

If I search for CategoryMajor:Restaurants, then Doc #1 is more relevant
than Doc #2.

Why is this, and how do I remedy this?

Thanks,
Allison

--


(phill) #3

What does your mapping look like? That can effect scoring.

http://localhost:9200/test/type1/_mapping?pretty=true

-Paul

On 9/7/2012 11:31 AM, Allison A. wrote:

I've provided some output below. The 4 instances of the term should be
most relevant, then the two, and one last. Yet this is not the case...

http://localhost:9200/test/type1/_search?pretty=true&q=cat:restaurants

{
"took" : 10,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 0.30685282,
"hits" : [ {
"_index" : "test",
"_type" : "type1",
"_id" : "doc2",
"_score" : 0.30685282, "_source" : {"cat": "restaurants restaurants restaurants restaurants"}
}, {
"_index" : "test",
"_type" : "type1",
"_id" : "doc3",
"_score" : 0.30685282, "_source" : {"cat": "restaurants"}
}, {
"_index" : "test",
"_type" : "type1",
"_id" : "doc1",
"_score" : 0.2712221, "_source" : {"cat": "restaurants restaurants"}
} ]
}
}

On Friday, September 7, 2012 1:11:52 PM UTC-4, Allison A. wrote:

I have a field "CategoryMajor"

I have two documents:
Doc 1: CategoryMajor: Restaurants
Doc 2: CategoryMajor: Restaurants, Restaurants, Restaurants,
Restaurants, Restaurants

If I search for CategoryMajor:Restaurants, then Doc #1 is more
relevant than Doc #2.

Why is this, and how do I remedy this?

Thanks,
Allison

--

--


(Clinton Gormley) #4

Hi Allison

On Fri, 2012-09-07 at 11:31 -0700, Allison A. wrote:

I've provided some output below. The 4 instances of the term should be
most relevant, then the two, and one last. Yet this is not the case...

You really need to provide a full recreation of the problem. That allows
us to test it locally. See http://www.elasticsearch.org/help

However, my guess would be that it is a combination of a few factors:

  1. you have few docs, and you are using the default of 5 primary shards,
    so your terms are not yet well distributed. You can eliminate this
    problem by:

    • indexing more docs (ie a real world test)
    • reducing your test index to a single shard
    • using search_type=dfs_query_then_fetch
  2. You are testing short bits of text. The "norm" for a field is stored
    (in Lucene < 4) in a single byte. So a field with 1 token may be
    considered to be the same "length" as a field with 2-4 tokens

clint

http://localhost:9200/test/type1/_search?pretty=true&q=cat:restaurants

{
"took" : 10,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 0.30685282,
"hits" : [ {
"_index" : "test",
"_type" : "type1",
"_id" : "doc2",
"_score" : 0.30685282, "_source" : {"cat": "restaurants restaurants restaurants restaurants"}
}, {
"_index" : "test",
"_type" : "type1",
"_id" : "doc3",
"_score" : 0.30685282, "_source" : {"cat": "restaurants"}
}, {
"_index" : "test",
"_type" : "type1",
"_id" : "doc1",
"_score" : 0.2712221, "_source" : {"cat": "restaurants restaurants"}
} ]
}
}

On Friday, September 7, 2012 1:11:52 PM UTC-4, Allison A. wrote:
I have a field "CategoryMajor"

    I have two documents:
    Doc 1: CategoryMajor: Restaurants
    Doc 2: CategoryMajor: Restaurants, Restaurants, Restaurants,
    Restaurants, Restaurants
    
    
    If I search for CategoryMajor:Restaurants, then Doc #1 is more
    relevant than Doc #2.
    
    
    Why is this, and how do I remedy this?
    
    
    Thanks,
    Allison

--

--


(system) #5