Human names searching - how to improve results

Hello colleagues,
I would like some help. :slight_smile:

I am trying to develop a Search query for human names (two or three) in around 100k records.

I am using a 14 days trial period for cloud Elastic for my research needs.

Via NEST (7.6.2) I created an index and put my data (the people with names.)

After spending some days researching the documentation and test my best results are get via the next combination, but I need more specific.

I get too wrong results but with a high match score and I cannot ignore it automatically because in other cases/searching this limit is correct.

var resSearch = 
	client.Search<People>(s => s
		.Index(indexName)
		//.From(0)
		.Size(100)
		//.MinScore(minScore)

		.Query(q => q
			.Match(mf => mf
				.Name("SearchQuery")

				.Field(f => f.Name.FullName)
				.Query(searchPhrase)

				.Analyzer("standard")

				.Operator(Operator.Or)

				.AutoGenerateSynonymsPhraseQuery(true)

				.Fuzziness(Fuzziness.Ratio(3))
				.MaxExpansions(100)

				.FuzzyTranspositions(true)
				.MinimumShouldMatch("2<75%")
			)
		)
	);

Thanks in advance!
Cheers! :slight_smile:

Welcome!

Could you provide a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible.

A full reproduction script will help readers to understand, reproduce and if needed fix your problem. It will also most likely help to get a faster answer.

Thank you @dadoonet for the remarks.
Below, I will put one create and one search query code.

My main target is in case I search via three names/words to get results with these three names or similar two-part names with hight scores.
In this combination of settings, the people with only two-part names are with too low scores.
I assume that the algorithm compares the string length also but this is not the desired effect from me.

And other my question is, because I am not sure, whether with the Standard cloud plan is it possible to install and use official plugins (not a custom plugin) also analyzers?

Valid NEST response built from a successful(200) low level call on 
POST: /watchlistentries/_bulk

# Request:
{
	"index": {
		"_id": "637261905404690685"
	}
} {
	"id": 637261905404690685,
	"name": {
		"firstName": "Donald",
		"lastName": "Trump",
		"createDate": "2020-05-27T18:35:40.4692597+03:00",
	},
	"watchListId": 0,
	"createDate": "2020-05-27T18:35:40.4765106+03:00",
} {
	"index": {
		"_id": "637261905404765615"
	}
} {
	"id": 637261905404765615,
	"name": {
		"firstName": "Vladimir",
		"middleName": "Vladimirovich",
		"lastName": "Putin",
		"createDate": "2020-05-27T18:35:40.4770289+03:00",
	},
	"watchListId": 0,
	"createDate": "2020-05-27T18:35:40.4770359+03:00",
}

# Response: {
	"took": 7,
	"errors": false,
	"items": [{
		"index": {
			"_index": "watchlistentries",
			"_type": "_doc",
			"_id": "637261905404690685",
			"_version": 1,
			"result": "created",
			"_shards": {
				"total": 2,
				"successful": 2,
				"failed": 0
			},
			"_seq_no": 730518,
			"_primary_term": 1,
			"status": 201
		}
	}, {
		"index": {
			"_index": "watchlistentries",
			"_type": "_doc",
			"_id": "637261905404765615",
			"_version": 1,
			"result": "created",
			"_shards": {
				"total": 2,
				"successful": 2,
				"failed": 0
			},
			"_seq_no": 730519,
			"_primary_term": 1,
			"status": 201
		}
	}]
}
Valid NEST response built from a successful (200) low level call on 
POST: /watchlistentries/_search?typed_keys=true

# Request:
{
	"query": {
		"match": {
			"name.fullName": {
				"analyzer": "standard",
				"auto_generate_synonyms_phrase_query": true,
				"fuzziness": 3.0,
				"fuzzy_transpositions": true,
				"max_expansions": 100,
				"minimum_should_match": "2<75%",
				"operator": "or",
				"query": "Donald John Trump",
				"_name": "SearchQuery"
			}
		}
	},
	"size": 100
}

# Response: 
{
	"took": 731,
	"timed_out": false,
	"_shards": {
		"total": 1,
		"successful": 1,
		"skipped": 0,
		"failed": 0
	},
	"hits": {
		"total": {
			"value": 1,
			"relation": "eq"
		},
		"max_score": 8.728588,
		"hits": [{
			"_index": "watchlistentries",
			"_type": "_doc",
			"_id": "430905",
			"_score": 8.728588,
			"_source": {
				"type": 1,
				"name": {
					"fullName": "Jean Ronald OSCAR",
					"firstName": "Jean",
					"middleName": "Ronald",
					"lastName": "OSCAR",
					"createDate": "2020-05-15T07:45:04.0221748Z",
				},
				"createDate": "2020-05-15T07:45:04.0221745Z",
			},
			"matched_queries": ["SearchQuery"]
		}]
	}
}

A full reproduction script is something anyone can copy and paste in Kibana dev console, click on the run button to reproduce your use case.

Could you make sure anyone can copy and paste your example and run it from Kibana?
I doubt the bulk API will work with indented JSON.

My mistake, now the code below is tested in Kibana Console.
Thanks @dadoonet

POST /watchlistentries/_bulk
{ "index" : { } }
{ "name": { "fullName": "Donald Trump", "firstName": "Donald", "lastName": "Trump", "createDate": "2020-05-27T18:35:40.4692597+03:00" }, "createDate": "2020-05-27T18:35:40.4692597+03:00", "description" : "Some other data not needed for the test." }
{ "index" : { } }
{ "name": { "fullName": "Vladimir Vladimirovich Putin", "firstName": "Vladimir", "middleName": "Vladimirovich", "lastName": "Putin", "createDate": "2020-05-27T18:35:40.4770359+03:00" }, "createDate": "2020-05-27T18:35:40.4770359+03:00", "description" : "Some other data not needed for the test." }
POST /watchlistentries/_search?typed_keys=true
{ "query": { "match": { "name.fullName": { "analyzer": "standard", "auto_generate_synonyms_phrase_query": true, "fuzziness": 3.0, "fuzzy_transpositions": true, "max_expansions": 100, "minimum_should_match": "2<75%", "operator": "or", "query": "Donald John Trump", "_name": "SearchQuery" } } }, "size": 100 }

This is returning:

{
  "took" : 7,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.5098253,
    "hits" : [
      {
        "_index" : "watchlistentries",
        "_type" : "_doc",
        "_id" : "pRQ1dXIBi3tAHrFjRbVY",
        "_score" : 1.5098253,
        "_source" : {
          "name" : {
            "fullName" : "Donald Trump",
            "firstName" : "Donald",
            "lastName" : "Trump",
            "createDate" : "2020-05-27T18:35:40.4692597+03:00"
          },
          "createDate" : "2020-05-27T18:35:40.4692597+03:00",
          "description" : "Some other data not needed for the test."
        },
        "matched_queries" : [
          "SearchQuery"
        ]
      }
    ]
  }
}

Is that wrong?

Hi @dadoonet,

We are trying to optimize our search in human names, but we cannot figure out what exact settings we should apply to accomplish that.

Let me give you another example to make it clearer. Here is the insert query :

POST /watchlistentries/_bulk
{ "index" : { } }
{ "name": { "fullName": "Sylvester Stallone" } }
{ "index" : { } }
{ "name": { "fullName": "Sylvester Enzio Stallone"} }
{ "index" : { } }
{ "name": { "fullName": "Stallone Sylvester"} }
{ "index" : { } }
{ "name": { "fullName": "Sylvester Stallone Enzio"} }
{ "index" : { } }
{ "name": { "fullName": "Stallone Sylvester Enzio"} }
{ "index" : { } }
{ "name": { "fullName": "Sylvester Enzio" } }
{ "index" : { } }
{ "name": { "fullName": "Sylvester John Stallone" } }
{ "index" : { } }
{ "name": { "fullName": "Sylvester Stallone John" } }
{ "index" : { } }
{ "name": { "fullName": "John Sylvester Stallone"} }
{ "index" : { } }
{ "name": { "fullName": "Stallone Sylvester John" } }

And then the search query:

POST /watchlistentries/_search?typed_keys=true
{ "query": { "match": { "name.fullName": { "analyzer": "standard", "auto_generate_synonyms_phrase_query": true, "fuzziness": 3.0, "fuzzy_transpositions": true, "max_expansions": 100, "minimum_should_match": "2<75%", "operator": "or", "query": "Sylvester Stallone", "_name": "SearchQuery" } } }, "size": 100 }

We made a basic console .Net application for listing the results, they are the same as the results you will receive in Kibana :

As you can see the "_score" is the same regardless the place of the matching phrase. We thought that the score is showing how relative the matching is.

For example if we search for "Sylvester Stallone" we expect "Sylvester Stallone Enzio" to have better score than "Stallone Sylvester John". And then we want to be able to filter the results for score greater than some number.

So here are our questions :

  1. Is there a way we could achieve that without using any additional paid plugins for human names search (changing the search settings for example) ?

  2. If the answer of 1) is "No", could you recommend any cheaper plugin we could use?

  3. If the answer of 1) is "Yes", could you recommend some settings or special queries, which are working good with human names?

  4. The trial version does not permit using plugins. How could we try if any works for us.

Thanks!

Best Regards!

You can combine multiple queries within a bool query to do better matching.
I wrote a full script here.

May that could help you?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.