Shingles which respect punctuation

mgvarley · June 16, 2015, 10:28am

I am building an address matching engine for UK addresses in Elasticsearch and have found shingles to be very useful however I am seeing some issues when it comes to punctuation. A query for "4 Walmley Close" is returning the following matches:

Units 3 And 4, Walmley Chambers, 3 Walmley Close
Flat 4, Walmley Court, 10 Walmley Close
Co-Operative Retail Services Ltd, 4 Walmley Close

The true match is number 3, however both 1 and 2 match (falsely) as they both become '4 walmley' when turned into shingles. I would like to tell the shingle analyzer not generate shingles that straddle commas. So, for example 1) currently I get:

units 3
3 and
and 4
4 walmley
walmley chambers
chambers 3
3 walmley
walmley close

...when in actual fact all I want is....

units 3
3 and
and 4
walmley chambers
3 walmley
walmley close

My current settings are below. I have experimented with swapping the tokenizer from standard to whitespace, this helps in that it retains the commas and would potentially avoid the situation above (i.e. I end up with '4, walmley' as my shingle in address 1 and 2) however I end up with lots of unusable shingles in my index and with 70 million documents I need to keep the index size down.

As you can see in the index settings I have also have a street_sym filter which I would love to be able to use in my shingles e.g. for this example, in addition to generating 'walmley close' I would like to have 'walmley cl' however when I attempted to include this I got shingles of 'close cl' which were not terribly helpful!

Any advice from more experienced Elasticsearch users would be hugely appreciated. I have read through Gormley and Tong's excellent book but cannot get my head around this particular issue.

Thanks in advance for any help offered.

"analysis": {
    "filter": {
        "shingle": {
            "type": "shingle",
                "output_unigrams": false
        },
        "street_sym": {
            "type": "synonym",
                "synonyms": [
                "st => street",
                "rd => road",
                "ave => avenue",
                "ct => court",
                "ln => lane",
                "terr => terrace",
                "cir => circle",
                "hwy => highway",
                "pkwy => parkway",
                "cl => close",
                "blvd => boulevard",
                "dr => drive",
                "ste => suite",
                "wy => way",
                "tr => trail"
            ]
        }
    },
    "analyzer": {
        "shingle": {
            "type": "custom",
                "tokenizer": "standard",
                "filter": [
                "lowercase",
                "shingle"
            ]
        }
    }
}

Eduard_Dudar · March 16, 2016, 5:20am

I'm looking for a solution to the similar problem. With raw Lucene access people split paragraphs manually but with ElasticSearch it's not quite possible.

Topic		Replies	Views
Shingle filter to allow mismatching spaces Elasticsearch	5	1538	November 30, 2020
Shingles not working with odd no. of search words Elasticsearch	2	656	March 24, 2017
Shingle analyzer Вопросы на русском языке	7	1104	April 10, 2018
How does shingle filter work on match_phrase in query phase? Elasticsearch	5	1626	July 6, 2017
Using shingle and stop filters Elasticsearch	2	390	July 6, 2020

Shingles which respect punctuation

Related topics