Elasticsearch version
Version: 6.2.2, Build: 10b1edd/2018-02-16T19:01:30.685723Z, JVM: 1.8.0_161
Plugins installed:
Opennlp
JVM version
java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)
OS version
Linux server 4.4.0-131-generic #157-Ubuntu SMP Thu Jul 12 15:51:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
The problem is with line 40602 in your synonym file:
s(104178190,2,'78',n,2,0).
The number 78 is completely removed by your lowercase tokenizer. That tokenizer is based on the letter tokenizer which has this behavior.
There are a few options to solve this issue. Firstly, you could remove the synonyms that are pure numbers from your synonym file, like line 40602 in wn_s.pl.
Or, you could switch to an analyzer that does not drop numbers. For example the standard analyzer in combination with the lowercase token filter (probably the best solution):
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.