Query on arrays


(shane) #1

I've got some indexed documents with some data that looks like this:

    "_source":{
           "title":"The Fault in Our Stars",
           "contributors":{
              "name":"John Green",
           }

}

Then I've got other documents with multiple contributors:

        "_source":{
           "title":"Horror for Good: A Charitable Anthology",
           "contributors":{
              "name":[
                 "Joe R Lansdale",
                 "Ray Garton",
                 "F. Paul Wilson",
                 "Ian Harding",
                 "Shaun Hutson",
                 "Jeff Strand",
                 "Jack Ketchum",
                 "Wrath James White",
                 "Monica J. O'Rourke",
                 "Lisa Morton",
                 "Laird Barron",
                 "Joe McKinney",
                 "Richard Salter",
                 "Thomas Lee",
                 "Gary McMahon",
                 "Taylor Grant",
                 "Lorne Dixon",
                 "Nate Southard",
                 "Tracie McBride",
                 "Robert S Wilson",
                 "John Mantooth",
                 "G.N. Braun",
                 "John F D Taff",
                 "Benjamin Kane Ethridge",
                 "Stephen Bacon",
                 "Steven W Booth",
                 "Brad C. Hodson",
                 "Jonathon Templar",
                 "Mark Scioneaux",
                 "R.J. Cavender",
                 "Norman L. Rubenstein",
                 "Danica Green",
                 "G.R. Yeates",
                 "Boyd E. Harris",
                 "Rena Mason"
              ],

           }

(This is simplified data, but I've included the relevant parts.)

The issue is when I do a multi match query on contributors.name. Take the
query "john green" as an example: I always get higher scoring for documents
like the second one, even though "john" and "green" aren't at all close to
each other.

If I do a multi match query with type=phrase that helps, but a document
with John Green as one of several authors always comes up at the top of the
list, beating out any document with John Green as the only author, and I
have custom scoring in place that should cause another particular document
to score higher. Also, I don't want to do phrase matching because I want to
be able to do a cross_fields query across title and contributor.name, so
that queries like "the fault in our stars john green" work.

So why are arrays with many values preferred over the single value in this
case? And how can I index or query this data to avoid this problem?

This is on ES 1.1.1.

Thanks,

Shane

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/04ac28f7-ee0f-4b8c-996b-f048fb4dd576%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(shane) #2

I should also note that the custom scoring is not the problem: it exhibits
a preference for documents with many values for contributor.name even when
the custom scoring is removed.

On Wednesday, June 25, 2014 7:29:26 PM UTC-7, shane wrote:

I've got some indexed documents with some data that looks like this:

    "_source":{
           "title":"The Fault in Our Stars",
           "contributors":{
              "name":"John Green",
           }

}

Then I've got other documents with multiple contributors:

        "_source":{
           "title":"Horror for Good: A Charitable Anthology",
           "contributors":{
              "name":[
                 "Joe R Lansdale",
                 "Ray Garton",
                 "F. Paul Wilson",
                 "Ian Harding",
                 "Shaun Hutson",
                 "Jeff Strand",
                 "Jack Ketchum",
                 "Wrath James White",
                 "Monica J. O'Rourke",
                 "Lisa Morton",
                 "Laird Barron",
                 "Joe McKinney",
                 "Richard Salter",
                 "Thomas Lee",
                 "Gary McMahon",
                 "Taylor Grant",
                 "Lorne Dixon",
                 "Nate Southard",
                 "Tracie McBride",
                 "Robert S Wilson",
                 "John Mantooth",
                 "G.N. Braun",
                 "John F D Taff",
                 "Benjamin Kane Ethridge",
                 "Stephen Bacon",
                 "Steven W Booth",
                 "Brad C. Hodson",
                 "Jonathon Templar",
                 "Mark Scioneaux",
                 "R.J. Cavender",
                 "Norman L. Rubenstein",
                 "Danica Green",
                 "G.R. Yeates",
                 "Boyd E. Harris",
                 "Rena Mason"
              ],

           }

(This is simplified data, but I've included the relevant parts.)

The issue is when I do a multi match query on contributors.name. Take the
query "john green" as an example: I always get higher scoring for documents
like the second one, even though "john" and "green" aren't at all close to
each other.

If I do a multi match query with type=phrase that helps, but a document
with John Green as one of several authors always comes up at the top of the
list, beating out any document with John Green as the only author, and I
have custom scoring in place that should cause another particular document
to score higher. Also, I don't want to do phrase matching because I want to
be able to do a cross_fields query across title and contributor.name, so
that queries like "the fault in our stars john green" work.

So why are arrays with many values preferred over the single value in this
case? And how can I index or query this data to avoid this problem?

This is on ES 1.1.1.

Thanks,

Shane

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8b4fd820-af8b-42df-ac2d-81ac5a51e7af%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #3