Indexing very long word

Hi folks,

I have a problem with the indexation of the text field... My mapping for this field looks like this:

"metadata": {
    "type": "text"
},

and some of the documents have there a very long string without whitespaces. I keep there data like this:

{'some': 'value', 'some': 'other_value_that_is_a_very_long_string_without_whitespaces'}

Generally, it's JSON but not always so I need to keep it as text.

Problem is that a couple of documents have there a string without whitespaces with the length over 32766...

When I want to reindex those documents I got an error:

    { _index: 'my_index',
      _type: 'my_doc',
      _id: 'some_id',
      status: 400,
      error: 
     { type: 'illegal_argument_exception',
     reason: 'Document contains at least one immense term in field="metadata" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: \'[50, 48, 49, 51, 50, 48, 49, 51, 50, 48, 49, 51, 50, 48, 49, 51, 50, 48, 49, 51, 50, 48, 49, 51, 50, 48, 49, 51, 50, 48]...\', original message: bytes can be at most 32766 in length; got 35960',
     caused_by: 
      { type: 'max_bytes_length_exceeded_exception',
        reason: 'bytes can be at most 32766 in length; got 35960' } } }

I need this field to be indexed because I need to be able to search docs by this field.

I tried to use a custom analyzer with whitespace tokenizer and char_filter here but is doesn't seem to work:

    "metadata_search_analyzer": {
      "tokenizer": "whitespace",
      "char_filter": [
        "metadata_char_filter"
      ]
    },
    "char_filter": {
      "metadata_char_filter": {
        "type": "pattern_replace",
        "pattern": "(.{32000})",
        "replacement": "$1 "
        }
      },

Is there a possibility to solve or work around this issue?

Cheers,
Jogi

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.