Synonym Token Filter Error


#1

Synonym Token Filter gives an Error.

Elasticsearch version
Version: 6.2.2, Build: 10b1edd/2018-02-16T19:01:30.685723Z, JVM: 1.8.0_161

Plugins installed:
Opennlp

JVM version
java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)

OS version
Linux server 4.4.0-131-generic #157-Ubuntu SMP Thu Jul 12 15:51:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

When I try to reproduce the following: https://qbox.io/blog/synonym-token-filter-wordnet-applications

I get the following error.

Error: 400 - failed to build synonyms
ES stack trace:

type: illegal_argument_exception
reason: failed to build synonyms

Steps to reproduce:

curl -XPUT 'localhost:9200/test_index' -d '{
 "analysis": {
    "filter": {
      "synonym": {
        "type": "synonym",
        "format": "wordnet",
        "synonyms_path": "analysis/wn_s.pl"
      }
    },
    "analyzer": {
      "wordnet-synonym-analyzer": {
        "tokenizer": "lowercase",
        "filter": [
          "synonym"
        ]
      }
    }
  }
}'

EDIT:

If I change the tokenizer to "standard" everything is fine. Does anyone have an idea what to do?


(Abdon Pijpelink) #2

The problem is with line 40602 in your synonym file:

s(104178190,2,'78',n,2,0).

The number 78 is completely removed by your lowercase tokenizer. That tokenizer is based on the letter tokenizer which has this behavior.

There are a few options to solve this issue. Firstly, you could remove the synonyms that are pure numbers from your synonym file, like line 40602 in wn_s.pl.

Or, you could switch to an analyzer that does not drop numbers. For example the standard analyzer in combination with the lowercase token filter (probably the best solution):

PUT /analyzers-blog-04-02
{
 "analysis": {
    "filter": {
      "synonym": {
        "type": "synonym",
        "format": "wordnet",
        "synonyms_path": "analysis/wn_s.pl"
      }
    },
    "analyzer": {
      "wordnet-synonym-analyzer": {
        "tokenizer": "standard",
        "filter": [
          "lowercase",
          "synonym"
        ]
      }
    }
  }
}

Or, if you really want to use the existing analyzer, you can set "lenient": true in your synonym token filter definition:

PUT /analyzers-blog-04-02
{
 "analysis": {
    "filter": {
      "synonym": {
        "type": "synonym",
        "format": "wordnet",
        "synonyms_path": "analysis/wn_s.pl",
        "lenient": true
      }
    },
    "analyzer": {
      "wordnet-synonym-analyzer": {
        "tokenizer": "lowercase",
        "filter": [
          "synonym"
        ]
      }
    }
  }
}