Zentity resolution leads to maxClauseCount queries

erewok · February 28, 2020, 4:40pm

Hello,

I started experimenting with the zentity plugin yesterday in order to see if we can use it to solve some of our entity resolution problems (instead of building our own custom software to do the same).

I find that I am getting a pretty high error rate, on an index with about 13 million entries, and a model with a single resolver that looks at name and phone number. Iterating through a test set of amounting to 1000 records totals where each record has a name and phone number, I get these exceptions thrown:

org.elasticsearch.ElasticsearchException$1: maxClauseCount is set to 1024
	at org.elasticsearch.ElasticsearchException.guessRootCauses(ElasticsearchException.java:639) ~[elasticsearch-7.3.2.jar:7.3.2]
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:137) [elasticsearch-7.3.2.jar:7.3.2]
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:264) [elasticsearch-7.3.2.jar:7.3.2]

What's worse, anytime these errors are thrown, it takes around 10-30 seconds to resolve itself, which makes it too slow for processing the full data set (around 70k entries).

Just before the exception, the console dumps part of the query to stderr and it looks like a giant query with all of the different phone numbers in the index.

Is there something I can do to prevent this from happening? Is this a result of something I have configured incorrectly?

Notes:

Elasticsearch 7.3.2
4gb Heap (on a 16gb machine)

(cc @davemoore)

Mark_Harwood · February 28, 2020, 4:51pm

You'd be best advised to raise an issue on the zentity github repo

Being a recursive function it can grow to a large number of terms but I suspect your data/config may not be realistic if a single person can amass > 1024 names/phone numbers.

If you're resolving only one entity at a time (as opposed to batch-resolving all entities) you could possibly look at the Graph feature in xpack to crawl the connections in data. It has logic to overcome Lucene's 1024 clause limit.

erewok · February 28, 2020, 4:56pm

This error gets thrown for one particular row with a single name and a single phone number (resolving one entity at a time).

Thanks for the advice on the Graph feature.

Mark_Harwood · February 28, 2020, 5:17pm

This is an example crawl from some real police data to resolve as single person entity mentioned in many crime docs.

Green dots represent crime doc IDs, pink dots represent a "compound key" field which is a combination of columns from the crime doc. This particular person has many pink dots because there are many aliases or data entry issues.

So an example crime doc is indexed like this:

{
    "crime_id" : "099903-2016",
    "keys": [ "KEYTYPE1: SMITH | JOE | 293 high street | 90210",
              "KEYTYPE2: SMITH | JOE | 03/04/1996",
                   ...other keys ...
    ]
}

Form these compound keys using multiple key types (eg. firstname+surname+zipcode and firstname+surname+DOB). If you only use one keytype you only match the same values so multiple matching strategies (keytypes) is needed to kick-start the ID chaining logic.

Index these as untokenized keyword fields for link-ability in Graph and also as tokenized text for searchability:

{
  "entity_references": {
	"mappings": {
	  "properties": {
		"doc_id": {
		  "type": "keyword"
		},
		"keys": {
		  "type": "keyword",
		  "fields": {
			"asText": {
			  "type": "text"
			}
		  }
		}
	  }
	}
  }
}

Then setup a workspace:

load up the Graph UI
select the index
*add the keys and doc_id fields as the node types you want to plot.
in the settings change certainty to 1 and disable significant links.
Save the workspace with a suitable name

Then you can explore by:

Running a search eg "smith"
hit the "+" button to walk out all of the connections

davemoore · March 4, 2020, 4:55pm

@erewok I've continued this discussion on your Github issue.

system · April 1, 2020, 4:55pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.