Synonym mapping for classification lookup / classifying documents


(Alex Smirnov) #1

we have a need to classify every document prior to recording it in our DB. We selected Elasticsearch SYNONYM functionality to do this work for us by building a dictionary of keywords associated to each classification.

Idea is to have high probability classification level and the example would be something like:

girl, female, woman, women, wife => women
lipstick, mascara => women, beauty
flats => women, shoes
accessory => electronics, women, men, kids

....
etc.

question is about how to make a similar type of setup for multi-word plural word handling, ie:

apple accessory => electronics, gadgets

http://localhost:9200/classifier/_analyze?analyzer=synonymclassification&text=apple accessory

above example will render correct results and suggest Electronics and Gadgets as 2 matched tokens, however doing a similar lookup for

    http://localhost:9200/classifier/_analyze?analyzer=synonymclassification&text=apple accessories

will return APPLE and ACCESSORY as 2 tokens and will "unmatch" pre-set word dictionaries.

Since we need to achieve highest accuracy in classification we are using kstem as one of the filters on our custom analyzer together with:

"analysis": {
                            "analyzer": {
                              "synonymclassification": {
                                "type": "custom",
                                "char_filter": "html_strip",
                                "filter": [
                                  "apostrophe",
                                  "standard",
                                  "elision",
                                  "asciifolding",
                                  "lowercase",
                                  "stop",
                                  "length",
                                  "my_stemmer",
                                  "synonym"
                                ],
                                "tokenizer": "standard"
                              },
...
                            "keywords": {
                              "type": "string",
                              "analyzer": "language",
                              "fields": {
                                "my_synonym": {
                                  "type": "string",
                                  "analyzer": "synonymclassification"
                                },
  1. Does this appear to be a good approach to the problem?

  2. Do we need to add/drop some of the filters to achieve desired behavior or do we instead need to specify each variation of all "search phrases, search phrase" in our synonym dictionary?

  3. Is it not too dangerous (resource intensive) to utilize http://localhost:9200/classifier/_analyze?... for these lookups at a very high rate - anywhere from 100 to 500 requests per second during original data discovery and classification?

Thank you for any help and pointers ...


(Loren Siebert) #2

Make sure my_stemmer is treating accessories the way you are expecting. Create a new custom analyzer that only handles kstem so you can isolate the behavior. My hunch is that it is not stemming it to accessory, and so your synonym filter is not kicking in.


(Alex Smirnov) #3

I tried it as

  "my_stemmer": {
        "type": "stemmer",
        "name": "light_english"
    },

and also:

   "my_stemmer": {
        "type": "kstem",
        "name": "light_english"
    },

bot to no full success thus far. any more suggestions?

if I put stemmer after synonym in analyzer - then results are even more unmanageable when you can't expect what it is going to stem to and building own dictionary is very complicated.


(Loren Siebert) #4

Those analyzers do not look right to me. Here's a sample session for you to run in Sense:


DELETE /test_index
# testing analyzers
PUT /test_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "light_analyzer": {
          "filter": "light_filter",
          "type": "custom",
          "tokenizer": "standard"
        },
        "minimal_analyzer": {
          "filter": "minimal_filter",
          "type": "custom",
          "tokenizer": "standard"
        },
        "snowball_analyzer": {
          "filter": "snowball_filter",
          "type": "custom",
          "tokenizer": "standard"
        }
      },
      "filter": {
        "light_filter": {
          "type": "stemmer",
          "name": "light_english"
        },
        "minimal_filter": {
          "type": "stemmer",
          "name": "minimal_english"
        },
        "snowball_filter": {
          "type": "snowball",
          "language": "English"
        }
      }
    }
  }
}

######################################################
# Proof that the analyzer is working

GET /test_index/_analyze?analyzer=light_analyzer&text=accessories+accessory
GET /test_index/_analyze?analyzer=minimal_analyzer&text=accessories+accessory
# note how snowball stems to "accessori"
GET /test_index/_analyze?analyzer=snowball_analyzer&text=accessories+accessory
# a shortcut to get the built-in snowball analyzer
GET /test_index/_analyze?analyzer=snowball&text=accessories+accessory
# standard analyzer doesn't touch the words
GET /test_index/_analyze?analyzer=standard&text=accessories+accessory

Some other resources:
My talk at Elastic{ON} has a few slides on pitfalls/gotchas with synonyms, stemming, and how to have both in an analysis chain.
The repo we call Punchcard has this test harness analysis chain that contains everything except the synonym filter at the end. We use the Inquisitor plugin to test out synonym candidates using those test harness analyzers. If "accessory" and "accessories" did not emerge from that chain as the same token, we might add that pair to our synonyms list.

As I also mention in my talk, choosing a stemmer is making a choice in how you want to be wrong, so try them all out with a representative sample of your corpus to see what behavior is the most reasonable for you.

As for performance, the Elasticsearch Definitive Guide mentions that dictionary based stemmers (e.g., kstem) are 4-5x slower than algorithmic stemmers, so keep that in mind for your performance tuning.


(Alex Smirnov) #5

problem becomes in feeding the stemmed words, and this is what half of the struggle is.
ideally i'd want to use regular words (in either single or plural form, not both) to resolve to a synonym or synonyms
but if i use stemmer before synonym, in analyzer chain, ie:

                            "my_synonym": {
                                "type": "custom",
                                "char_filter": "html_strip",
                                "filter": [
                                  "apostrophe",
                                  "standard",
                                  "elision",
                                  "asciifolding",
                                  "lowercase",
                                  "stop",
                                  "length",
                                  "minimal_filter",
                                  "synonym"
                                ],

then words like:

clothes, pants => clothing

will not resolve to anything because stemmer will conver word clothes to clothe and i wasn't able to figure out an easy way to build this type of synonym dicionary, as every word is stemmed and same exceptions are needed.

ie:
flats => shoes for women
but with stemming it turns into flat and i cannot logically map word flat to women's shoes

any further suggestions ?


(Loren Siebert) #6

We maintain a YAML file containing both the stemmed and the unstemmed version of the words. Look at how we handle business/businesses here:

The actual ES synonym file we generate from this YAML contains this entry:

business, businesse

If you have a synonym filter after your stemmer, you will need to put the stemmed form in the synonyms file. Otherwise you have to list out all the possible word forms.


(Alex Smirnov) #7

I was trying to figure this one out yesterday but didn't get the point that it auto-creates stemmed version of the words.
Do you have a way of reloading file after initial index load, or do you have to close/open index or restart elasticsearch every time that you make a change to this synonym file?

how do you handle multi-word keywords / phrases?

Thank you


(Loren Siebert) #8

Yes, I monitor that repo and push out the synonym files to the various codebases that use it, and then reindex the indexes that rely on those synonyms using index aliases so we don't have downtime.

We've punted on multi-word synonyms for now, but we're looking at doing something like this:
"womens pumps, ladies footwear" => "$FEMALE_SHOE$"


(system) #9