I'm doing fuzzy match address lookup on a field using a shingle filter.
It appears that language model based scoring (TFIDF, BM25 etc.) isn't appropriate.
E.g. a fuzzy match on a rare term (SMYTH STREET) shouldn't score more than an exact match on a common term (SMITH STREET). I think I just need a constant score for each term/shingle and then these to be weighted by edit distance and summed (which I think is the default way that term scores are combined).
"constant_score" and "script_score" aren't right because they set the score for the whole doc and I just want to override the score for each term. Is there an easier way than adding my own Lucene scorer?
Then maybe you can use
constant_score on each individual clause rather than on the whole query?
thanks for the idea - but then wouldn't the client would have to apply the shingle filter to get all the terms (or make a separate ES request to do it) in order to build a rather big query? That kind of breaks the idea of analysis being a server side thing that the client doesn't need to worry about. I'd much rather find a way to tweak the scoring in the server, to make it appropriate to the task (I'm sure I'm not the first or last user to search data that is not really natural language). Cheers, Neil
Can you share your current query?
curl 'localhost:9200/gnaf/_search?pretty' -d '
"query": "7 LONDON CIRCUIT CITY ACT 2601",
I'm trying to get the best (or best equal) hit as the top result, even with bad input such as an underspecified address (e.g. missing street number), typing errors, postcode before state, wrong locality name etc. so a bulk lookup can just take the top hit.
The index contains locality and street aliases where more than one name is acceptable, also flat/unit and level/floor and geocodes.
and thanks for your time!
Unfortunately there's so much more to this particular ranking problem than tweaking settings on a generic fuzzy string matching algorithm. There's a lot of ambiguity in addresses and some of the logic that a human applies to parse addresses might be as follows:
7 Oxford St I know we don't mean the town of
1 High St, Richmond, Yorkshire I know we definitely don't mean the other town of Richmond 200 miles away in London.
Given a search for
57 Thurlton st and matches on "57 Thurlton" and "57 high st" I know that "thurlton" is a higher-value word.
Given a search for
Thirlton avenue I know "avenue" is always spelled correctly in my reference data and we do not need to go fuzzy on that word.
With a specialized search domain like this it can make sense to provide a layer of software that attempts to parse and understand the user query better before then rewriting it into a request to Elasticsearch.
Another strategy is to have a big
bool query with a
should array filled with different forms of running the same user input, ranging from the sloppy "any word plus fuzzy" to the very strict e.g. exact phrase matches. Docs which satisfy more of the given clauses will naturally rank higher.
Thanks for the ideas. A lot can go wrong when you attempt to segment an address of unknown format (e.g. building names, unit/flats, levels/floors, number ranges, prefixes, suffixes, abbreviations, aliases for streets and localities, postcode placement). In general we don't know the error characteristics of the input (which fields are most likely to contain errors). I've found that using bigrams/shingles and fuzzy matching actually handles these issues pretty well (I evaluate each change against many variations of addresses).
I'm currently testing changes to ClassicSimilarity.tf() and idf() (by subclassing) to address the original "not natural language" issue. I'm doing this in raw Lucene for now. Can anyone advise on how best to add a new custom Similarity class to ES? Could it be done with a groovy script? Thanks for all your help.