Document(s) failed to index | mapper_parsing_exception

Hi,

I have an ES instance with the following mapping & config:

{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1,
    "analysis": {
      "analyzer": {
        "custom_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "custom_asciifolding",
            "lowercase"
          ]
        }
      },
      "filter": {
        "custom_asciifolding": {
          "type": "asciifolding",
          "preserve_original": true
        }
      },
      "normalizer": {
        "custom_normalizer": {
          "type": "custom",
          "char_filter": [],
          "filter": [
            "custom_asciifolding"
          ]
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "artist_id": {
          "type": "integer"
        },
        "artist_genre": {
          "type": "keyword",
          "normalizer": "custom_normalizer"
        },
        "artist_name": {
          "type": "text",
          "analyzer": "custom_analyzer",
          "fields": {
            "raw": {
              "normalizer": "custom_normalizer",
              "type": "keyword"
            }
          }
        },
        "artist_type": {
          "type": "keyword",
          "normalizer": "custom_normalizer"
        },
        "associated_alias": {
          "type": "nested",
          "properties": {
            "alias_type": {
              "type": "keyword",
              "normalizer": "custom_normalizer"
            },
            "artist_id": {
              "type": "integer"
            },
            "artist_name": {
              "type": "text",
              "analyzer": "custom_analyzer"
            }
          }
        },
        "associated_artists": {
          "type": "nested",
          "properties": {
            "_id": {
              "type": "keyword"
            },
            "artist_id": {
              "type": "integer"
            },
            "artist_name": {
              "type": "text",
              "analyzer": "custom_analyzer"
            },
            "sequence_number": {
              "type": "integer"
            }
          }
        },
        "is_active": {
          "type": "boolean"
        },
        "record_provider_name": {
          "type": "keyword",
          "normalizer": "custom_normalizer"
        },
        "record_providers": {
          "type": "nested",
          "properties": {
            "name": {
              "type": "keyword",
              "normalizer": "custom_normalizer"
            },
            "count": {
              "type": "long"
            }
          }
        }
      }
    }
  }
}

While updating an existing document, I get the following error:

elasticsearch.exceptions.RequestError: RequestError(400, 'mapper_parsing_exception', \"failed to parse field [artist_name.raw] of type [keyword] in document with id 'uQRyF3sBPez8u38O8yMm'. Preview of field's value: 'Relajaci\u00f3n'\")"}

While creating a new document, I get a similar error:

      "error": {
        "type": "mapper_parsing_exception",
        "reason": "failed to parse field [artist_name.raw] of type [keyword] in document with id '8a9d479a-b665-4b49-97de-e4efcb7be446'. Preview of field's value: 'Andrea Miller, Alejandro Fernu00e1ndez Lecce'",
        "caused_by": {
          "type": "illegal_state_exception",
          "reason": "The normalization token stream is expected to produce exactly 1 token, but got 2+ for analyzer analyzer name[custom_normalizer], analyzer [org.elasticsearch.index.analysis.CustomAnalyzer@7c67f588], analysisMode [ALL] and input \"Andrea Miller, Alejandro Fernández Lecce\""
        }
      },

Please note that in both cases, there are some non-ASCII characters like Fernández and Relajación.

HI @mohitthakkar_ihm

Look the message:

the normalization token stream is expected to produce exactly 1 token, but got 2+ for analyzer analyzer name[custom_normalizer].

Your custom_normalizer is generating two token due the config custom_asciifolding:

  "custom_asciifolding": {
          "type": "asciifolding",
          "preserve_original": true
  }

Test your analyzer that use the custom_asciifolding

GET idx_name/_analyze
{
  "analyzer": "custom_analyzer",
  "text": ["Fernández"]
}

{
  "tokens": [
    {
      "token": "Fernandez",
      "start_offset": 0,
      "end_offset": 9,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "Fernández",
      "start_offset": 0,
      "end_offset": 9,
      "type": "<ALPHANUM>",
      "position": 0
    }
  ]
}

If you set preserve_original to false the error will not happen. I don't know the reason to use preserve_original = true, but it is the cause of your problem, maybe you need to think of another strategy to use normalize.

Thanks @RabBit_BR

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.