I am building an address matching engine for UK addresses in Elasticsearch and have found shingles to be very useful however I am seeing some issues when it comes to punctuation. A query for "4 Walmley Close" is returning the following matches:
- Units 3 And 4, Walmley Chambers, 3 Walmley Close
- Flat 4, Walmley Court, 10 Walmley Close
- Co-Operative Retail Services Ltd, 4 Walmley Close
The true match is number 3, however both 1 and 2 match (falsely) as they both become '4 walmley' when turned into shingles. I would like to tell the shingle analyzer not generate shingles that straddle commas. So, for example 1) currently I get:
- units 3
- 3 and
- and 4
- 4 walmley
- walmley chambers
- chambers 3
- 3 walmley
- walmley close
...when in actual fact all I want is....
- units 3
- 3 and
- and 4
- walmley chambers
- 3 walmley
- walmley close
My current settings are below. I have experimented with swapping the tokenizer from standard to whitespace, this helps in that it retains the commas and would potentially avoid the situation above (i.e. I end up with '4, walmley' as my shingle in address 1 and 2) however I end up with lots of unusable shingles in my index and with 70 million documents I need to keep the index size down.
As you can see in the index settings I have also have a street_sym filter which I would love to be able to use in my shingles e.g. for this example, in addition to generating 'walmley close' I would like to have 'walmley cl' however when I attempted to include this I got shingles of 'close cl' which were not terribly helpful!
Any advice from more experienced Elasticsearch users would be hugely appreciated. I have read through Gormley and Tong's excellent book but cannot get my head around this particular issue.
Thanks in advance for any help offered.
"analysis": {
"filter": {
"shingle": {
"type": "shingle",
"output_unigrams": false
},
"street_sym": {
"type": "synonym",
"synonyms": [
"st => street",
"rd => road",
"ave => avenue",
"ct => court",
"ln => lane",
"terr => terrace",
"cir => circle",
"hwy => highway",
"pkwy => parkway",
"cl => close",
"blvd => boulevard",
"dr => drive",
"ste => suite",
"wy => way",
"tr => trail"
]
}
},
"analyzer": {
"shingle": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"shingle"
]
}
}
}