Advanced scoring of muti field searching (only count a token once)


(mtthrok) #1

Hi,

I'm trying to create a search engine over users that searches over people's
names and other metadata like where they work (so a user could query "tim
cook apple"). I have a mapping schema (pasted below) with a first_name,
last_name, and affils column (which will usually be a list of affiliations
like "apple"). Each field is indexed with the full tokens as well as an
additional "partial" field that has ngrams.

To query against it, I was originally using a bool/should that hit each of
the fields (like in
http://elasticsearch-users.115913.n3.nabble.com/help-needed-with-the-query-td3177477.html#a3178856).
The issue is that if someone was named "Tim Cook" and worked at "Tim Cook
Design" they would come up to the top, even for "Tim Cook Apple". In our
case, we really only want to count each token to the score once so that Tim
Cook at Tim Cook Design scores as well as Tim Cook at Apple for the query
"Tim Cook." I switched to a query_string query (pasted below) which does
better but still is giving more weight to those cases (now "Tim Cook Apple"
works, but "Tim Cook" still gives more weight to the one at Tim Cook
Deisgn) .

So, how can I customize the scoring of query_search (or another multi field
query) to only let a token contribue to the score in one field?

Thanks! I'm pretty new to elasticsearch/lucene, so sorry if this is obvious.

===========
Mapping setup:
curl -XPUT 'http://127.0.0.1:9200/people_search/?pretty=1' -d '
{
"mappings" : {
"person" : {
"properties" : {
"_id" : {
"type" : "integer",
"index" : "no"
},
"last_name" : {
"fields" : {
"partial" : {
"search_analyzer" : "full_name",
"index_analyzer" : "partial_name",
"type" : "string"
},
"last_name" : {
"type" : "string",
"analyzer" : "full_name"
}
},
"type" : "multi_field"
},
"first_name" : {
"fields" : {
"partial" : {
"search_analyzer" : "partial_name_search",
"index_analyzer" : "partial_name",
"type" : "string"
},
"first_name" : {
"type" : "string",
"analyzer" : "full_name"
}
},
"type" : "multi_field"
},
"affils" : {
"fields" : {
"partial" : {
"search_analyzer" : "full_name",
"index_analyzer" : "partial_name",
"type" : "string"
},
"names" : {
"type" : "string",
"analyzer" : "full_name"
}
},
"type" : "multi_field"
}

     }
  }

},
"settings" : {
"analysis" : {
"filter" : {
"name_ngrams" : {
"side" : "front",
"max_gram" : 10,
"min_gram" : 2,
"type" : "edgeNGram"
},
"name_ngrams_search" : {
"side" : "front",
"max_gram" : 10,
"min_gram" : 2,
"type" : "edgeNGram"
}
},
"analyzer" : {
"full_name" : {
"filter" : [
"standard",
"lowercase",
"asciifolding"
],
"type" : "custom",
"tokenizer" : "standard"
},
"partial_name" : {
"filter" : [
"standard",
"lowercase",
"asciifolding",
"name_ngrams"
],
"type" : "custom",
"tokenizer" : "standard"
},
"partial_name_search" : {
"filter" : [
"standard",
"lowercase",
"asciifolding",
"name_ngrams_search"
],
"type" : "custom",
"tokenizer" : "standard"
}

     }
  }

}
}
'

=============
Insert some data
curl -XPOST 'http://127.0.0.1:9200/_bulk?pretty=1' -d '
{"index" : {"_index" : "people_search", "_type" : "person", "_id" : 1}}
{"_id" : 1, "last_name" : "Cook", "first_name" : "Tim", "affils":["Apple"]}
{"index" : {"_index" : "people_search", "_type" : "person", "_id" : 2}}
{"_id" : 2, "last_name" : "Cook", "first_name" : "Tim", "affils":["Tim Cook
Design", "Random co"]}
'

===============
Query:
curl -XPOST 'people_search/person/_search?search_type=dfs_query_then_fetch'
-d"
{'query': {'query_string': {'fields': ['first_name.partial',
'first_name.first_name^1.5',
'last_name.partial',
'last_name.last_name^1.5',
'affils.partial',
'affils.names^1.5'],
'query': 'Tim Cook',
'use_dis_max': True}}}
"

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(mtthrok) #2

I guess I could create a new indexed field that uses an analyzer that
returns the unique tokens from the first_name, last_name, and affils fields
and just query against that. Is that the best way to do it?

On Monday, June 24, 2013 4:04:44 PM UTC-7, mtt...@gmail.com wrote:

Hi,

I'm trying to create a search engine over users that searches over
people's names and other metadata like where they work (so a user could
query "tim cook apple"). I have a mapping schema (pasted below) with a
first_name, last_name, and affils column (which will usually be a list of
affiliations like "apple"). Each field is indexed with the full tokens as
well as an additional "partial" field that has ngrams.

To query against it, I was originally using a bool/should that hit each of
the fields (like in
http://elasticsearch-users.115913.n3.nabble.com/help-needed-with-the-query-td3177477.html#a3178856).
The issue is that if someone was named "Tim Cook" and worked at "Tim Cook
Design" they would come up to the top, even for "Tim Cook Apple". In our
case, we really only want to count each token to the score once so that Tim
Cook at Tim Cook Design scores as well as Tim Cook at Apple for the query
"Tim Cook." I switched to a query_string query (pasted below) which does
better but still is giving more weight to those cases (now "Tim Cook Apple"
works, but "Tim Cook" still gives more weight to the one at Tim Cook
Deisgn) .

So, how can I customize the scoring of query_search (or another multi
field query) to only let a token contribue to the score in one field?

Thanks! I'm pretty new to elasticsearch/lucene, so sorry if this is
obvious.

===========
Mapping setup:
curl -XPUT 'http://127.0.0.1:9200/people_search/?pretty=1' -d '
{
"mappings" : {
"person" : {
"properties" : {
"_id" : {
"type" : "integer",
"index" : "no"
},
"last_name" : {
"fields" : {
"partial" : {
"search_analyzer" : "full_name",
"index_analyzer" : "partial_name",
"type" : "string"
},
"last_name" : {
"type" : "string",
"analyzer" : "full_name"
}
},
"type" : "multi_field"
},
"first_name" : {
"fields" : {
"partial" : {
"search_analyzer" : "partial_name_search",
"index_analyzer" : "partial_name",
"type" : "string"
},
"first_name" : {
"type" : "string",
"analyzer" : "full_name"
}
},
"type" : "multi_field"
},
"affils" : {
"fields" : {
"partial" : {
"search_analyzer" : "full_name",
"index_analyzer" : "partial_name",
"type" : "string"
},
"names" : {
"type" : "string",
"analyzer" : "full_name"
}
},
"type" : "multi_field"
}

     }
  }

},
"settings" : {
"analysis" : {
"filter" : {
"name_ngrams" : {
"side" : "front",
"max_gram" : 10,
"min_gram" : 2,
"type" : "edgeNGram"
},
"name_ngrams_search" : {
"side" : "front",
"max_gram" : 10,
"min_gram" : 2,
"type" : "edgeNGram"
}
},
"analyzer" : {
"full_name" : {
"filter" : [
"standard",
"lowercase",
"asciifolding"
],
"type" : "custom",
"tokenizer" : "standard"
},
"partial_name" : {
"filter" : [
"standard",
"lowercase",
"asciifolding",
"name_ngrams"
],
"type" : "custom",
"tokenizer" : "standard"
},
"partial_name_search" : {
"filter" : [
"standard",
"lowercase",
"asciifolding",
"name_ngrams_search"
],
"type" : "custom",
"tokenizer" : "standard"
}

     }
  }

}
}
'

=============
Insert some data
curl -XPOST 'http://127.0.0.1:9200/_bulk?pretty=1' -d '
{"index" : {"_index" : "people_search", "_type" : "person", "_id" : 1}}
{"_id" : 1, "last_name" : "Cook", "first_name" : "Tim", "affils":["Apple"]}
{"index" : {"_index" : "people_search", "_type" : "person", "_id" : 2}}
{"_id" : 2, "last_name" : "Cook", "first_name" : "Tim", "affils":["Tim
Cook Design", "Random co"]}
'

===============
Query:
curl -XPOST
'people_search/person/_search?search_type=dfs_query_then_fetch' -d"
{'query': {'query_string': {'fields': ['first_name.partial',
'first_name.first_name^1.5',
'last_name.partial',
'last_name.last_name^1.5',
'affils.partial',
'affils.names^1.5'],
'query': 'Tim Cook',
'use_dis_max': True}}}
"

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(germap) #3

Hi,

just for the record, did you finally manage to solve this problem?
The docs [1] say the default behavior ("best_fields") for multi_match
queries is to only take the highest score into account, you can configure
it to include the score from other fields though.

I was facing a similar issue and could overcome it by boosting the most
relevant field with a "^2".

Regards,

Germán

[1] http://www.elasticsearch.org/blog/multi-field-search-just-got-better/

El lunes, 24 de junio de 2013 18:37:08 UTC-5, mtt...@gmail.com escribió:

I guess I could create a new indexed field that uses an analyzer that
returns the unique tokens from the first_name, last_name, and affils fields
and just query against that. Is that the best way to do it?

On Monday, June 24, 2013 4:04:44 PM UTC-7, mtt...@gmail.com wrote:

Hi,

I'm trying to create a search engine over users that searches over
people's names and other metadata like where they work (so a user could
query "tim cook apple"). I have a mapping schema (pasted below) with a
first_name, last_name, and affils column (which will usually be a list of
affiliations like "apple"). Each field is indexed with the full tokens as
well as an additional "partial" field that has ngrams.

To query against it, I was originally using a bool/should that hit each
of the fields (like in
http://elasticsearch-users.115913.n3.nabble.com/help-needed-with-the-query-td3177477.html#a3178856).
The issue is that if someone was named "Tim Cook" and worked at "Tim Cook
Design" they would come up to the top, even for "Tim Cook Apple". In our
case, we really only want to count each token to the score once so that Tim
Cook at Tim Cook Design scores as well as Tim Cook at Apple for the query
"Tim Cook." I switched to a query_string query (pasted below) which does
better but still is giving more weight to those cases (now "Tim Cook Apple"
works, but "Tim Cook" still gives more weight to the one at Tim Cook
Deisgn) .

So, how can I customize the scoring of query_search (or another multi
field query) to only let a token contribue to the score in one field?

Thanks! I'm pretty new to elasticsearch/lucene, so sorry if this is
obvious.

===========
Mapping setup:
curl -XPUT 'http://127.0.0.1:9200/people_search/?pretty=1' -d '
{
"mappings" : {
"person" : {
"properties" : {
"_id" : {
"type" : "integer",
"index" : "no"
},
"last_name" : {
"fields" : {
"partial" : {
"search_analyzer" : "full_name",
"index_analyzer" : "partial_name",
"type" : "string"
},
"last_name" : {
"type" : "string",
"analyzer" : "full_name"
}
},
"type" : "multi_field"
},
"first_name" : {
"fields" : {
"partial" : {
"search_analyzer" : "partial_name_search",
"index_analyzer" : "partial_name",
"type" : "string"
},
"first_name" : {
"type" : "string",
"analyzer" : "full_name"
}
},
"type" : "multi_field"
},
"affils" : {
"fields" : {
"partial" : {
"search_analyzer" : "full_name",
"index_analyzer" : "partial_name",
"type" : "string"
},
"names" : {
"type" : "string",
"analyzer" : "full_name"
}
},
"type" : "multi_field"
}

     }
  }

},
"settings" : {
"analysis" : {
"filter" : {
"name_ngrams" : {
"side" : "front",
"max_gram" : 10,
"min_gram" : 2,
"type" : "edgeNGram"
},
"name_ngrams_search" : {
"side" : "front",
"max_gram" : 10,
"min_gram" : 2,
"type" : "edgeNGram"
}
},
"analyzer" : {
"full_name" : {
"filter" : [
"standard",
"lowercase",
"asciifolding"
],
"type" : "custom",
"tokenizer" : "standard"
},
"partial_name" : {
"filter" : [
"standard",
"lowercase",
"asciifolding",
"name_ngrams"
],
"type" : "custom",
"tokenizer" : "standard"
},
"partial_name_search" : {
"filter" : [
"standard",
"lowercase",
"asciifolding",
"name_ngrams_search"
],
"type" : "custom",
"tokenizer" : "standard"
}

     }
  }

}
}
'

=============
Insert some data
curl -XPOST 'http://127.0.0.1:9200/_bulk?pretty=1' -d '
{"index" : {"_index" : "people_search", "_type" : "person", "_id" : 1}}
{"_id" : 1, "last_name" : "Cook", "first_name" : "Tim",
"affils":["Apple"]}
{"index" : {"_index" : "people_search", "_type" : "person", "_id" : 2}}
{"_id" : 2, "last_name" : "Cook", "first_name" : "Tim", "affils":["Tim
Cook Design", "Random co"]}
'

===============
Query:
curl -XPOST
'people_search/person/_search?search_type=dfs_query_then_fetch' -d"
{'query': {'query_string': {'fields': ['first_name.partial',
'first_name.first_name^1.5',
'last_name.partial',
'last_name.last_name^1.5',
'affils.partial',
'affils.names^1.5'],
'query': 'Tim Cook',
'use_dis_max': True}}}
"

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/eaeb62d6-beea-4e53-a1f3-7fc3ce972145%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #4