I have an Elasticsearch index on one shard, for which I'm running a [dis_max][4]
query that, given some user details
(First Name, Last Name, Date of Birth, Address, Phone, Username, Email etc.)
queries users from an index combining a set of criteria/matching clauses.
E.g.
-
match username (
[fuzzy][1]
, boosted 2x) -
should match first and last name (
[bool][3]
combining[match-term][2]
query for FN and LN, boosted 1.1x) -
must match FN, LN and DOB (
[bool][3]
combining[fuzzy][1]
for FN and LN and[match-term][2]
for DOB, boosted 3x) -
match phone (
[match-term][2]
boosted 2x)
etc.
(See resources
[1]: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-fuzzy-query.html
[2]: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html
[3]: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html
[4]: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-dis-max-query.html )
See query below (with obscured input data):
https://gist.github.com/andreaschiffo/bf5ebeac1d6875a1a78dbb9e2eb8e19b
All criteria account for a score
and I've set a tie_breaker
to 0.5
so that the score of a result will be the max amongst all the scores, plus 0.5
times the rest of the scores.
Performing the query with few input combinations,
- on some instances I get good scores that make for good matching,
- on other instances, even expecting same or high enough score I get a very low score because some of the most relevant matching clauses seem to be skipped.
I have in fact debugged the query execution with "explain": true
and in the explanation
- the first result scores a high value with all query clauses,
- the second one (that from the data should score enough) just scores a low value and some clauses don't appear in the explanation as if they were excluded/ignored.
I'd like to understand why these would be ignored/skipped in some cases.
Is anybody aware if this could be an issue in the way ES builds queryes into Solr?
See result example below (all data obscured but the results would be quite close in the distinct fields).
https://gist.github.com/andreaschiffo/e8c3d6b2f86c53ba6a28257d47a1831b