Shingles which respect punctuation


(Mark Varley) #1

I am building an address matching engine for UK addresses in Elasticsearch and have found shingles to be very useful however I am seeing some issues when it comes to punctuation. A query for "4 Walmley Close" is returning the following matches:

  1. Units 3 And 4, Walmley Chambers, 3 Walmley Close
  2. Flat 4, Walmley Court, 10 Walmley Close
  3. Co-Operative Retail Services Ltd, 4 Walmley Close

The true match is number 3, however both 1 and 2 match (falsely) as they both become '4 walmley' when turned into shingles. I would like to tell the shingle analyzer not generate shingles that straddle commas. So, for example 1) currently I get:

  • units 3
  • 3 and
  • and 4
  • 4 walmley
  • walmley chambers
  • chambers 3
  • 3 walmley
  • walmley close

...when in actual fact all I want is....

  • units 3
  • 3 and
  • and 4
  • walmley chambers
  • 3 walmley
  • walmley close

My current settings are below. I have experimented with swapping the tokenizer from standard to whitespace, this helps in that it retains the commas and would potentially avoid the situation above (i.e. I end up with '4, walmley' as my shingle in address 1 and 2) however I end up with lots of unusable shingles in my index and with 70 million documents I need to keep the index size down.

As you can see in the index settings I have also have a street_sym filter which I would love to be able to use in my shingles e.g. for this example, in addition to generating 'walmley close' I would like to have 'walmley cl' however when I attempted to include this I got shingles of 'close cl' which were not terribly helpful!

Any advice from more experienced Elasticsearch users would be hugely appreciated. I have read through Gormley and Tong's excellent book but cannot get my head around this particular issue.

Thanks in advance for any help offered.

"analysis": {
    "filter": {
        "shingle": {
            "type": "shingle",
                "output_unigrams": false
        },
        "street_sym": {
            "type": "synonym",
                "synonyms": [
                "st => street",
                "rd => road",
                "ave => avenue",
                "ct => court",
                "ln => lane",
                "terr => terrace",
                "cir => circle",
                "hwy => highway",
                "pkwy => parkway",
                "cl => close",
                "blvd => boulevard",
                "dr => drive",
                "ste => suite",
                "wy => way",
                "tr => trail"
            ]
        }
    },
    "analyzer": {
        "shingle": {
            "type": "custom",
                "tokenizer": "standard",
                "filter": [
                "lowercase",
                "shingle"
            ]
        }
    }
}

(Eduard Dudar) #2

I'm looking for a solution to the similar problem. With raw Lucene access people split paragraphs manually but with ElasticSearch it's not quite possible.


(system) #3