Non-standard calculation of field length for norms

beowulfenator · September 23, 2015, 1:14pm

I am searching for locations and their synonyms. For example, "San Francisco" is really ["San Francisco", "Frisco", "SF"].

My problem is that for this array field length is 4 tokens. So matching "San Francisco" to the array is worth less than matching it to "South San Francisco" because that field has only 3 tokens.

I tried disabling field norms altogether, and that works a bit better, but it would be great to somehow to configure ElasticSearch to use the first element of the array to calculate field length for normalization purposes. In other words, I want my array to be 2 tokens long, while "South San Francisco" to be 3 tokens long.

Is that possible?

Also, if field-length norms are disabled, is it possible to somehow reward matching all tokens in a field, so that matching "York" to "York" would be worth more than matching "York" to "New York"?

softwaredoug · September 23, 2015, 3:20pm

If those other elements are synonyms, you may want to put them in a synonym filter instead of using a multivalue field. Instead of being additional values, they'll overlap the original token. This way, norms can be computed and you can discount the overlaps when dealing with norms.

beowulfenator · September 23, 2015, 4:28pm

I'm not sure synonym filter will handle this case. Let's say I have "New York" as my field value and it's broken down into tokens "new" and "york". How can I synonymize this with "NY"?

softwaredoug · September 24, 2015, 3:49pm

The synonym filter tries to expand the tokens into the token stream at the appropriate position.

For example, I created the following (using python API)

from elasticsearch import Elasticsearch
from elasticsearch.exceptions import NotFoundError


settings = {
    'settings': {
        "number_of_shards": 1,
        "number_of_replicas": 1,
        'analysis': {
            "analyzer": {
                "city-syn-analyzer": {
                    "type": 'custom',
                    'tokenizer': 'whitespace',
                    'filter': ['city-synonym']
                }
            },
            "filter" : {
                "city-synonym" : {
                    "type" : "synonym",
                    "synonyms" : [
                        "new york, NY, NYC, New York City",
                    ]
                }
            }
        }
    }
}

if __name__ == "__main__":
    from  sys import argv
    es = Elasticsearch(argv[1])
    try:
        es.indices.delete('test')
    except NotFoundError:
        pass
    es.indices.create('test', body=settings)

When I test this with elyzer (our easier to read analyzer debugger) I get:

$ elyzer --es http://localhost:9200 --index test --analyzer city-syn-analyzer --text "New York City"
TOKENIZER: whitespace
{1:New}	{2:York}	{3:City}	
TOKEN_FILTER: city-synonym
{1:new,NY,NYC,New}	{2:york,York}	{3:City}

The first number is the token position. Notice how "new", "NY" and "New" overlap? (i'm not lowercasing which is silly) Here, by discounting overlaps, the default similarity will only count this as a document of length 3. Effectively, it gets the length of the longest term (New York City).

beowulfenator · September 24, 2015, 7:20pm

Thanks for the idea! Too bad I already added more documents to the index, one for every city synonym. Actually, this is even better, because when matching NY to NY, the score is higher than when matching NY to the list of synonyms that has a length of 3.