Indexing very long word

Jogi_Michal · March 25, 2020, 2:10pm

Hi folks,

I have a problem with the indexation of the text field... My mapping for this field looks like this:

"metadata": {
    "type": "text"
},

and some of the documents have there a very long string without whitespaces. I keep there data like this:

{'some': 'value', 'some': 'other_value_that_is_a_very_long_string_without_whitespaces'}

Generally, it's JSON but not always so I need to keep it as text.

Problem is that a couple of documents have there a string without whitespaces with the length over 32766...

When I want to reindex those documents I got an error:

    { _index: 'my_index',
      _type: 'my_doc',
      _id: 'some_id',
      status: 400,
      error: 
     { type: 'illegal_argument_exception',
     reason: 'Document contains at least one immense term in field="metadata" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: \'[50, 48, 49, 51, 50, 48, 49, 51, 50, 48, 49, 51, 50, 48, 49, 51, 50, 48, 49, 51, 50, 48, 49, 51, 50, 48, 49, 51, 50, 48]...\', original message: bytes can be at most 32766 in length; got 35960',
     caused_by: 
      { type: 'max_bytes_length_exceeded_exception',
        reason: 'bytes can be at most 32766 in length; got 35960' } } }

I need this field to be indexed because I need to be able to search docs by this field.

I tried to use a custom analyzer with whitespace tokenizer and char_filter here but is doesn't seem to work:

    "metadata_search_analyzer": {
      "tokenizer": "whitespace",
      "char_filter": [
        "metadata_char_filter"
      ]
    },
    "char_filter": {
      "metadata_char_filter": {
        "type": "pattern_replace",
        "pattern": "(.{32000})",
        "replacement": "$1 "
        }
      },

Is there a possibility to solve or work around this issue?

Cheers,
Jogi

system · April 22, 2020, 2:10pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
UTF8 encoding is longer than the max length 32766 Elasticsearch	4	17613	July 6, 2017
[ELASTICSEARCH] UTF8 encoding is longer than the max length 32766 Discussions en français	1	1645	July 6, 2017
Max_bytes_length_exceeded_exception Reason Elasticsearch	1	828	March 8, 2022
Bytes can be at most 32766 in length Elasticsearch	9	14359	February 2, 2020
Please correct the analyzer to not produce such terms Elasticsearch	2	2745	July 5, 2017

Indexing very long word

Related topics