Synonym Token Filter Error

Jensxy · December 8, 2018, 7:57pm

Synonym Token Filter gives an Error.

Elasticsearch version
Version: 6.2.2, Build: 10b1edd/2018-02-16T19:01:30.685723Z, JVM: 1.8.0_161

Plugins installed:
Opennlp

JVM version
java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)

OS version
Linux server 4.4.0-131-generic #157-Ubuntu SMP Thu Jul 12 15:51:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

When I try to reproduce the following: https://qbox.io/blog/synonym-token-filter-wordnet-applications

I get the following error.

Error: 400 - failed to build synonyms
ES stack trace:

type: illegal_argument_exception
reason: failed to build synonyms

Steps to reproduce:

curl -XPUT 'localhost:9200/test_index' -d '{
 "analysis": {
    "filter": {
      "synonym": {
        "type": "synonym",
        "format": "wordnet",
        "synonyms_path": "analysis/wn_s.pl"
      }
    },
    "analyzer": {
      "wordnet-synonym-analyzer": {
        "tokenizer": "lowercase",
        "filter": [
          "synonym"
        ]
      }
    }
  }
}'

EDIT:

If I change the tokenizer to "standard" everything is fine. Does anyone have an idea what to do?

abdon · December 9, 2018, 11:01am

The problem is with line 40602 in your synonym file:

s(104178190,2,'78',n,2,0).

The number 78 is completely removed by your lowercase tokenizer. That tokenizer is based on the letter tokenizer which has this behavior.

There are a few options to solve this issue. Firstly, you could remove the synonyms that are pure numbers from your synonym file, like line 40602 in wn_s.pl.

Or, you could switch to an analyzer that does not drop numbers. For example the standard analyzer in combination with the lowercase token filter (probably the best solution):

PUT /analyzers-blog-04-02
{
 "analysis": {
    "filter": {
      "synonym": {
        "type": "synonym",
        "format": "wordnet",
        "synonyms_path": "analysis/wn_s.pl"
      }
    },
    "analyzer": {
      "wordnet-synonym-analyzer": {
        "tokenizer": "standard",
        "filter": [
          "lowercase",
          "synonym"
        ]
      }
    }
  }
}

Or, if you really want to use the existing analyzer, you can set "lenient": true in your synonym token filter definition:

PUT /analyzers-blog-04-02
{
 "analysis": {
    "filter": {
      "synonym": {
        "type": "synonym",
        "format": "wordnet",
        "synonyms_path": "analysis/wn_s.pl",
        "lenient": true
      }
    },
    "analyzer": {
      "wordnet-synonym-analyzer": {
        "tokenizer": "lowercase",
        "filter": [
          "synonym"
        ]
      }
    }
  }
}

system · January 6, 2019, 11:01am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch or Lucene synonyms (wordnet) Bug Elasticsearch	5	1145	January 22, 2018
Error on reindex using WordNet synonyms file Elasticsearch	5	913	January 10, 2018
Problems with synonyms in Elastic 6.2 Elasticsearch	4	4198	March 12, 2018
Synonym graph filterにてエラーの出る単語を一括検知に関して Elasticsearch	1	795	October 8, 2020
(IOException while reading synonyms_path_path) synonym token filter configuration Elasticsearch	2	2057	October 1, 2019

Synonym Token Filter Error

Related topics