I started experimenting with the zentity plugin yesterday in order to see if we can use it to solve some of our entity resolution problems (instead of building our own custom software to do the same).
I find that I am getting a pretty high error rate, on an index with about 13 million entries, and a model with a single resolver that looks at name and phone number. Iterating through a test set of amounting to 1000 records totals where each record has a name and phone number, I get these exceptions thrown:
org.elasticsearch.ElasticsearchException$1: maxClauseCount is set to 1024
at org.elasticsearch.ElasticsearchException.guessRootCauses(ElasticsearchException.java:639) ~[elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:137) [elasticsearch-7.3.2.jar:7.3.2]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:264) [elasticsearch-7.3.2.jar:7.3.2]
What's worse, anytime these errors are thrown, it takes around 10-30 seconds to resolve itself, which makes it too slow for processing the full data set (around 70k entries).
Just before the exception, the console dumps part of the query to stderr and it looks like a giant query with all of the different phone numbers in the index.
Is there something I can do to prevent this from happening? Is this a result of something I have configured incorrectly?
Being a recursive function it can grow to a large number of terms but I suspect your data/config may not be realistic if a single person can amass > 1024 names/phone numbers.
If you're resolving only one entity at a time (as opposed to batch-resolving all entities) you could possibly look at the Graph feature in xpack to crawl the connections in data. It has logic to overcome Lucene's 1024 clause limit.
Green dots represent crime doc IDs, pink dots represent a "compound key" field which is a combination of columns from the crime doc. This particular person has many pink dots because there are many aliases or data entry issues.
So an example crime doc is indexed like this:
{
"crime_id" : "099903-2016",
"keys": [ "KEYTYPE1: SMITH | JOE | 293 high street | 90210",
"KEYTYPE2: SMITH | JOE | 03/04/1996",
...other keys ...
]
}
Form these compound keys using multiple key types (eg. firstname+surname+zipcode and firstname+surname+DOB). If you only use one keytype you only match the same values so multiple matching strategies (keytypes) is needed to kick-start the ID chaining logic.
Index these as untokenized keyword fields for link-ability in Graph and also as tokenized text for searchability:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.