Scoring docs that are not natural language

neil.bacon · August 19, 2016, 1:48am

Hi.
I'm doing fuzzy match address lookup on a field using a shingle filter.
It appears that language model based scoring (TFIDF, BM25 etc.) isn't appropriate.
E.g. a fuzzy match on a rare term (SMYTH STREET) shouldn't score more than an exact match on a common term (SMITH STREET). I think I just need a constant score for each term/shingle and then these to be weighted by edit distance and summed (which I think is the default way that term scores are combined).

"constant_score" and "script_score" aren't right because they set the score for the whole doc and I just want to override the score for each term. Is there an easier way than adding my own Lucene scorer?

jpountz · August 19, 2016, 8:59am

Then maybe you can use constant_score on each individual clause rather than on the whole query?

neil.bacon · August 21, 2016, 11:53pm

thanks for the idea - but then wouldn't the client would have to apply the shingle filter to get all the terms (or make a separate ES request to do it) in order to build a rather big query? That kind of breaks the idea of analysis being a server side thing that the client doesn't need to worry about. I'd much rather find a way to tweak the scoring in the server, to make it appropriate to the task (I'm sure I'm not the first or last user to search data that is not really natural language). Cheers, Neil

jpountz · August 22, 2016, 1:33pm

Can you share your current query?

neil.bacon · August 22, 2016, 11:12pm

Sure:

curl 'localhost:9200/gnaf/_search?pretty' -d '
{
    "query": {
        "match": {
            "d61Address": {
                "query": "7 LONDON CIRCUIT CITY ACT 2601",
                "fuzziness": 2,
                "prefix_length": 2
            }
        }
    },
    "size": 5
}'

I'm trying to get the best (or best equal) hit as the top result, even with bad input such as an underspecified address (e.g. missing street number), typing errors, postcode before state, wrong locality name etc. so a bulk lookup can just take the top hit.
The index contains locality and street aliases where more than one name is acceptable, also flat/unit and level/floor and geocodes.

and thanks for your time!

Mark_Harwood · August 23, 2016, 10:07am

Unfortunately there's so much more to this particular ranking problem than tweaking settings on a generic fuzzy string matching algorithm. There's a lot of ambiguity in addresses and some of the logic that a human applies to parse addresses might be as follows:

Given 7 Oxford St I know we don't mean the town of Oxford
Given 1 High St, Richmond, Yorkshire I know we definitely don't mean the other town of Richmond 200 miles away in London.
Given a search for 57 Thurlton st and matches on "57 Thurlton" and "57 high st" I know that "thurlton" is a higher-value word.
Given a search for Thirlton avenue I know "avenue" is always spelled correctly in my reference data and we do not need to go fuzzy on that word.

With a specialized search domain like this it can make sense to provide a layer of software that attempts to parse and understand the user query better before then rewriting it into a request to elasticsearch.

Another strategy is to have a big bool query with a should array filled with different forms of running the same user input, ranging from the sloppy "any word plus fuzzy" to the very strict e.g. exact phrase matches. Docs which satisfy more of the given clauses will naturally rank higher.

neil.bacon · August 23, 2016, 11:33pm

Thanks for the ideas. A lot can go wrong when you attempt to segment an address of unknown format (e.g. building names, unit/flats, levels/floors, number ranges, prefixes, suffixes, abbreviations, aliases for streets and localities, postcode placement). In general we don't know the error characteristics of the input (which fields are most likely to contain errors). I've found that using bigrams/shingles and fuzzy matching actually handles these issues pretty well (I evaluate each change against many variations of addresses).

I'm currently testing changes to ClassicSimilarity.tf() and idf() (by subclassing) to address the original "not natural language" issue. I'm doing this in raw Lucene for now. Can anyone advise on how best to add a new custom Similarity class to ES? Could it be done with a groovy script? Thanks for all your help.

Topic		Replies	Views
Fuzziness & score computation Elasticsearch	2	5844	July 6, 2017
Scoring per term match Elasticsearch	1	559	July 5, 2017
Fuzzy query scoring based on levenshtein distance Elasticsearch	4	2680	July 6, 2017
Elasticsearch returns documents with score 0.0 Elasticsearch	1	647	January 18, 2021
Scoring in Exact and Phrase Matching Elasticsearch	2	576	July 5, 2017

Scoring docs that are not natural language

Related topics