Using synonym filter in es 6.0.0-rc1

mumpi · October 17, 2017, 12:39pm

Instead of migrating es 5.6 indices to es 6 I wanted to recreate them. But somehow the behaviour of synonym filter has changed.

The symptom is a message like "Invalid synonym rule at line 3" when creating the index and I cannot see what's wrong with that line 3:

launisch,launische,launischem,launischen,launischer,launisches
abend-make-up,abend-make-ups
ueberich,ueber-ich,ueber-ichs,ueberichs
ehrliebend,ehrliebende,ehrliebendem,ehrliebenden,ehrliebender,ehrliebendes

Of course it is not just this line; my synonym-list is quite big and I would have to change it by trial and error.

So I prefer to know, what is different in 6.0. I have seen parameters "tokenizer" and "ignore_case" are deprecated, but I anyway did not make use of them in 5.6.

Added later: I experimented a little bit more: I think I drop the entries with "-" because the are handled as 2 words after my tokenization. But then I ran into the next trap: It looks like previous version did not bother (or complain) about inconsistencies, i.e. same synonym in different lines with different 1st entry.

Added later: Found out that if I remove the stopword filter containing terms also present in the synonym-list lets the create index call run without error. I found that behaviour strange; I thought different filters in a series work indepently on what is presentet at input.

Anybody with experiences in that area?

Any hint is appreciated very much.

Thanks, regards, Jürg

my analyzer which uses the synonym-filter looks like

        "expand": {
            "type": "custom",
            "char_filter": [
                "komischeZeichen"
            ],
            "filter": [
                "lowercase",
                "morph"
            ],
            "tokenizer": "standard"
        }

my filter morph:

        "morph": {
            "expand": true,
            "type": "synonym",
            "synonyms" : [

"launisch,launische,launischem,launischen,launischer,launisches",
"abend-make-up,abend-make-ups",
"ueberich,ueber-ich,ueber-ichs,ueberichs",
"ehrliebend,ehrliebende,ehrliebendem,ehrliebenden,ehrliebender,ehrliebendes"
]
}

the charfilter:

       "komischeZeichen": {
            "type": "mapping",
            "mappings": [
                "＇=>,", "'=>,", "´=>,", "`=>,", "’=>,", "Œ=>OE", "œ=>oe", "¡=>i", "À=>A", "Á=>A", "Â=>A", "Ã=>A", "Ä=>Ae", "Å=>A", "Æ=>AE", "Ç=>C", "È=>E", "É=>E", "Ê=>E", "Ë=>E", "Ì=>I", "Í=>I", "Î=>I", "Ï=>IIII", "Ð=>D", "Ñ=>N", "Ò=>O", "Ó=>O", "Ô=>O", "Õ=>O", "Ö=>Oe", "Ù=>U", "Ú=>U", "Û=>U", "Ü=>Ue", "Ý=>Y", "ß=>ss", "à=>a", "á=>a", "â=>a", "ã=>a", "ä=>ae", "å=>a", "æ=>ae", "ç=>c", "è=>e", "é=>e", "ê=>e", "ë=>e", "ì=>i", "í=>i", "î=>i", "ï=>iiii", "ð=>d", "ñ=>n", "ò=>o", "ó=>o", "ô=>o", "õ=>o", "ö=>oe", "ù=>u", "ú=>u", "û=>u", "ü=>ue", "ý=>y", "ÿ=>y", "Ā=>A", "ā=>a", "Ă=>A", "ă=>a", "Ą=>A", "ą=>a", "Ć=>C", "ć=>c", "Ĉ=>C", "ĉ=>c", "Ċ=>C", "ċ=>c", "Č=>C", "č=>c", "Ď=>D", "ď=>d", "Đ=>D", "đ=>d", "Ē=>E", "ē=>e", "Ĕ=>E", "ĕ=>e", "Ė=>E", "ė=>e", "Ę=>E", "ę=>e", "Ě=>E", "ě=>e", "Ĝ=>G", "ĝ=>g", "Ğ=>G", "ğ=>g", "Ġ=>G", "ġ=>g", "Ģ=>G", "ģ=>g", "Ĥ=>H", "ĥ=>h", "Ħ=>H", "ħ=>h", "Ĩ=>I", "ĩ=>i", "Ī=>I", "ī=>i", "Ĭ=>I", "ĭ=>i", "Į=>I", "į=>i", "İ=>I", "ı=>i", "Ĳ=>IJ", "ĳ=>ij", "Ĵ=>J", "ĵ=>j", "Ķ=>K", "ķ=>k", "ĸ=>K", "Ĺ=>L", "ĺ=>l", "Ļ=>L", "ļ=>l", "Ľ=>L", "ľ=>l", "Ŀ=>L", "ŀ=>l", "Ł=>L", "ł=>l", "Ń=>N", "ń=>n", "Ņ=>N", "ņ=>n", "Ň=>N", "ň=>n", "ŉ=>n", "Ŋ=>N", "ŋ=>n", "Ō=>O", "ō=>o", "Ŏ=>O", "ŏ=>o", "Ő=>O", "ő=>o", "Ŕ=>R", "ŕ=>r", "Ŗ=>R", "ŗ=>r", "Ř=>R", "ř=>r", "Ś=>S", "ś=>s", "Ŝ=>S", "ŝ=>s", "Ş=>S", "ş=>s", "Š=>S", "š=>s", "Ţ=>T", "ţ=>t", "Ť=>T", "ť=>t", "Ŧ=>T", "ŧ=>t", "Ũ=>U", "ũ=>u", "Ū=>U", "ū=>u", "Ŭ=>U", "ŭ=>u", "Ů=>U", "ů=>u", "Ű=>U", "ű=>u", "Ų=>U", "ų=>u", "Ŵ=>W", "ŵ=>w", "Ŷ=>Y", "ŷ=>y", "Ÿ=>Y", "Ź=>Z", "ź=>z", "Ż=>Z", "ż=>z", "Ž=>Z", "ž=>z", "Þ=>th", "Ø=>O", "þ=>Th", "ø=>o"
            ]
        }

spinscale · October 23, 2017, 9:13am

Hey,

can you please a provide a fully reproducible example contain all the curl/console calls you did so others can reproduce?

I tried this and it worked

DELETE foo

PUT foo
{
  "mappings": {
    "bar" : {
      "properties": {
        "field" : {
          "type": "text",
          "analyzer": "expand"
        }
      }
    }
  }, 
  "settings": {
    "analysis": {
      "analyzer": {
        "expand": {
          "type": "custom",
          "char_filter": [
            "komischeZeichen"
          ],
          "filter": [
            "lowercase",
            "morph"
          ],
          "tokenizer": "standard"
        }
      },
      "filter": {
        "morph": {
          "expand": true,
          "type": "synonym",
          "synonyms": [
            "launisch,launische,launischem,launischen,launischer,launisches",
            "abend-make-up,abend-make-ups",
            "ueberich,ueber-ich,ueber-ichs,ueberichs",
            "ehrliebend,ehrliebende,ehrliebendem,ehrliebenden,ehrliebender,ehrliebendes"
          ]
        }
      },
      "char_filter": {
        "komischeZeichen": {
          "type": "mapping",
          "mappings": [
            "＇=>,",
            "'=>,",
            "´=>,",
            "`=>,",
            "’=>,",
            "Œ=>OE",
            "œ=>oe",
            "¡=>i",
            "À=>A",
            "Á=>A",
            "Â=>A",
            "Ã=>A",
            "Ä=>Ae",
            "Å=>A",
            "Æ=>AE",
            "Ç=>C",
            "È=>E",
            "É=>E",
            "Ê=>E",
            "Ë=>E",
            "Ì=>I",
            "Í=>I",
            "Î=>I",
            "Ï=>IIII",
            "Ð=>D",
            "Ñ=>N",
            "Ò=>O",
            "Ó=>O",
            "Ô=>O",
            "Õ=>O",
            "Ö=>Oe",
            "Ù=>U",
            "Ú=>U",
            "Û=>U",
            "Ü=>Ue",
            "Ý=>Y",
            "ß=>ss",
            "à=>a",
            "á=>a",
            "â=>a",
            "ã=>a",
            "ä=>ae",
            "å=>a",
            "æ=>ae",
            "ç=>c",
            "è=>e",
            "é=>e",
            "ê=>e",
            "ë=>e",
            "ì=>i",
            "í=>i",
            "î=>i",
            "ï=>iiii",
            "ð=>d",
            "ñ=>n",
            "ò=>o",
            "ó=>o",
            "ô=>o",
            "õ=>o",
            "ö=>oe",
            "ù=>u",
            "ú=>u",
            "û=>u",
            "ü=>ue",
            "ý=>y",
            "ÿ=>y",
            "Ā=>A",
            "ā=>a",
            "Ă=>A",
            "ă=>a",
            "Ą=>A",
            "ą=>a",
            "Ć=>C",
            "ć=>c",
            "Ĉ=>C",
            "ĉ=>c",
            "Ċ=>C",
            "ċ=>c",
            "Č=>C",
            "č=>c",
            "Ď=>D",
            "ď=>d",
            "Đ=>D",
            "đ=>d",
            "Ē=>E",
            "ē=>e",
            "Ĕ=>E",
            "ĕ=>e",
            "Ė=>E",
            "ė=>e",
            "Ę=>E",
            "ę=>e",
            "Ě=>E",
            "ě=>e",
            "Ĝ=>G",
            "ĝ=>g",
            "Ğ=>G",
            "ğ=>g",
            "Ġ=>G",
            "ġ=>g",
            "Ģ=>G",
            "ģ=>g",
            "Ĥ=>H",
            "ĥ=>h",
            "Ħ=>H",
            "ħ=>h",
            "Ĩ=>I",
            "ĩ=>i",
            "Ī=>I",
            "ī=>i",
            "Ĭ=>I",
            "ĭ=>i",
            "Į=>I",
            "į=>i",
            "İ=>I",
            "ı=>i",
            "Ĳ=>IJ",
            "ĳ=>ij",
            "Ĵ=>J",
            "ĵ=>j",
            "Ķ=>K",
            "ķ=>k",
            "ĸ=>K",
            "Ĺ=>L",
            "ĺ=>l",
            "Ļ=>L",
            "ļ=>l",
            "Ľ=>L",
            "ľ=>l",
            "Ŀ=>L",
            "ŀ=>l",
            "Ł=>L",
            "ł=>l",
            "Ń=>N",
            "ń=>n",
            "Ņ=>N",
            "ņ=>n",
            "Ň=>N",
            "ň=>n",
            "ŉ=>n",
            "Ŋ=>N",
            "ŋ=>n",
            "Ō=>O",
            "ō=>o",
            "Ŏ=>O",
            "ŏ=>o",
            "Ő=>O",
            "ő=>o",
            "Ŕ=>R",
            "ŕ=>r",
            "Ŗ=>R",
            "ŗ=>r",
            "Ř=>R",
            "ř=>r",
            "Ś=>S",
            "ś=>s",
            "Ŝ=>S",
            "ŝ=>s",
            "Ş=>S",
            "ş=>s",
            "Š=>S",
            "š=>s",
            "Ţ=>T",
            "ţ=>t",
            "Ť=>T",
            "ť=>t",
            "Ŧ=>T",
            "ŧ=>t",
            "Ũ=>U",
            "ũ=>u",
            "Ū=>U",
            "ū=>u",
            "Ŭ=>U",
            "ŭ=>u",
            "Ů=>U",
            "ů=>u",
            "Ű=>U",
            "ű=>u",
            "Ų=>U",
            "ų=>u",
            "Ŵ=>W",
            "ŵ=>w",
            "Ŷ=>Y",
            "ŷ=>y",
            "Ÿ=>Y",
            "Ź=>Z",
            "ź=>z",
            "Ż=>Z",
            "ż=>z",
            "Ž=>Z",
            "ž=>z",
            "Þ=>th",
            "Ø=>O",
            "þ=>Th",
            "ø=>o"
          ]
        }
      }
    }
  }
}

PUT foo/bar/1
{
  "field" : "this is a ehrliebende test"
}

P.S. Have you tried the ascii folding filter to not maintain that crazy list fully yourself? https://www.elastic.co/guide/en/elasticsearch/reference/5.6/analysis-asciifolding-tokenfilter.html

mumpi · October 23, 2017, 11:13am

Dear Alexander

Thank you for your support. I found out, that the problem exists when stopword filters contain words also present in the synonym list and it is a list which replaces the synomyms by first entry for the term.

Minimal setup:

PUT /foo

{

"mappings": {

"bar": {

  "properties": {

    "field": {

      "type": "text",

      "analyzer": "root"

    }

  }

}

},

"settings": {

"analysis": {

  "analyzer": {

    "root": {

      "type": "custom",

      "filter": [

        "stopDe",

        "root"

      ],

      "tokenizer": "standard"

    }

  },

  "filter": {

    "stopDe" : {

      "type" : "stop",

      "stopwords" : ["die" ]

    },

    "root": {

      "expand": false,

      "type": "synonym",

      "synonyms": [

        "der,die,das"



      ]

    }

  }

}

}

Results in:

{

"error": {

"root_cause": [

  {

    "type": "illegal_argument_exception",

    "reason": "failed to build synonyms"

  }

],

"type": "illegal_argument_exception",

"reason": "failed to build synonyms",

"caused_by": {

  "type": "parse_exception",

  "reason": "Invalid synonym rule at line 1",

  "caused_by": {

    "type": "illegal_argument_exception",

    "reason": "term: die was completely eliminated by analyzer"

  }

}

},

"status": 400

}

Dropping the stop-filter in the analyzer makes the index-creation run without errors.

Best regards, Jürg

SMD Schweizer Mediendatenbank AG

Badenerstrasse 119

CH-8004 Zürich

Telephon: +41 44 315 60 80

Website: http://www.smd.ch http://www.smd.ch

mumpi · October 30, 2017, 1:49pm

last insight:

if I reverse the order of filters - i.e. synonym before stopwords - then there is no error.

system · November 27, 2017, 1:51pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Why the synonym filter change in 6.0? Elasticsearch	8	3339	June 27, 2018
ES 6.6.2 - Synonym Filter Not Working as Expected Elasticsearch	1	366	August 14, 2019
Problems with synonyms in Elastic 6.2 Elasticsearch	4	4183	March 12, 2018
Unable to bypass restriction with synonym token filter Elasticsearch	7	1014	October 3, 2019
Adding Synonyms on existing Index Elasticsearch	3	2998	July 6, 2017

Using synonym filter in es 6.0.0-rc1

Related topics