Index 256 out of bounds for length 256

1057888035 · August 14, 2023, 2:37am

Hi Team,
I am trying to execute api like this in elasticsearch

GET /_analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": ["is a test data is a test data is a test data is a test data is a test data is a test data is a test data is a test data is a test data is a test data is a test data is a test data is a test data is a test data is a test data is a test data is a test data  is a test data is a test data is a test data is a test data is a test data is a test data is a test data is a test data is a test data is a test data => test"]
    }
  ],
  
  "text": "heart malignant hemangiopericytoma is a test data"
}

echo error: "Index 256 out of bounds for length 256",
This error mabye that the mappings length cannot exceed 256， but my project needs more

Christian_Dahlqvist · August 14, 2023, 5:36am

Replacing really long strings is not necessarily what a char filter is designed for. What is the problem you are trying to solve using analysis with this char filter (at a higher level)? Can you please elaborate on the problem and use case and also provide a realistic example of how you want your data analyzed?

1057888035 · August 14, 2023, 8:21am

I have a project
"benzylpenicillin allergy" is a keyword , this equals "benzyl penicillin allergy".

Now, there is a text

POST /index/_doc
{
    "text": "benzyl penicillin allergy should not be used in tissues with poor blood flow. If allergic symptoms occur (e.g. skin rash, itching, shortness of breath), tell a doctor immediately . Before treatment, a hypersensitivity test should be performed if possible."
}

I want this document ,when I search for "benzylpenicillin allergy" .

I use standard analyzer, it will be analyze to "benzylpenicillin, allergy" ，
which is not feasible , be because I still have some document containing "allergy" or "penicillin" .

So, I removed the space between them with the mappings , and "benzylpenicillinallergy" is equal to "benzylpenicillinallergy" use alias filter

Do you have a better way ？ Thanks

1057888035 · August 14, 2023, 8:38am

more:

"isoniazide allergy" is equal to "inh allergy" and is equal to "isonicotinylhydrazide allergy".

”superficial mycosis“ is equal to "steroid-modified tinea infection" and is equal to "piedra".

dadoonet · August 14, 2023, 1:22pm

What about using synonyms?

1057888035 · August 15, 2023, 3:13am

I tried synonyms but it doesn't solve the problem.

It won't work when my synonym contains spaces.

Like this:
analysis/synonym.txt

thiopental allergy,penthiobarbital allergy,pentothiobarbital allergy
d-mannitol allergy,mannitol allergy
cefotaxime allergy
cephalosporin allergy
amodiaquine allergy,camoquin allergy,flavoquine allergy

The result I want is

GET test_index/_analyze
{
   "analyzer":"my_analyzer",
   "text": "thiopental allergy is a test"
}

result:

{
  "tokens": [
    {
      "token": "thiopental allergy",
      "start_offset": 0,
      "end_offset": 18,
      "type": "word",
      "position": 0
    },
    {
      "token": "penthiobarbital allergy",
      "start_offset": 0,
      "end_offset": 18,
      "type": "word",
      "position": 0
    },
    {
      "token": "pentothiobarbital allergy",
      "start_offset": 0,
      "end_offset": 18,
      "type": "word",
      "position": 0
    },
    {
      "token": "is",
      "start_offset": 19,
      "end_offset": 21,
      "type": "word",
      "position": 0
    },
    {
      "token": "a",
      "start_offset": 22,
      "end_offset": 23,
      "type": "word",
      "position": 0
    },
    {
      "token": "test",
      "start_offset": 24,
      "end_offset": 28,
      "type": "word",
      "position": 0
    },
  ]
}

but, the truth is

1. when I use "," tokenizer and synonyms

PUT test_index
{
    "settings": {
        "index": {
            "number_of_shards": 5,
            "number_of_replicas": 1,
            "analysis": {
                "analyzer": {   
                    "index_analyzer": {
                        "tokenizer": "standard",
                        "filter": ["lowercase"],
                        "type": "custom"
                    },
                    "my_analyzer": {
                        "tokenizer": "comma",
                        "filter": ["my_synonym","lowercase"],
                        "type": "custom"
                    }
                },
                "filter": {
                    "my_synonym": {
                        "ignore_case": "true",
                        "expand": "true",
                        "type": "synonym",
                        "synonyms_path": "analysis/synonym.txt"
                    }
                },
                "tokenizer":{
					"comma":{
						"type": "pattern",
						"pattern":",|，"
					}
			    }
            }
        }
    },
    "mappings": {
        "properties": {
            "abstract": {
                "type": "text",
                "analyzer": "index_analyzer",
                "search_analyzer": "my_analyzer"
            }
        }
    }
}


GET test_index/_analyze
{
   "analyzer":"my_analyzer",
   "text": "thiopental allergy is a test"
}

result:

{
  "tokens": [
    {
      "token": "thiopental allergy is a test",
      "start_offset": 0,
      "end_offset": 28,
      "type": "word",
      "position": 0
    }
  ]
}

2. when I use standard tokenizer and synonyms

PUT test_index2
{
    "settings": {
        "index": {
            "number_of_shards": 5,
            "number_of_replicas": 1,
            "analysis": {
                "analyzer": {   
                    "index_analyzer": {
                        "tokenizer": "standard",
                        "filter": ["lowercase"],
                        "type": "custom"
                    },
                    "my_analyzer": {
                        "tokenizer": "standard",
                        "filter": ["my_synonym","lowercase"],
                        "type": "custom"
                    }
                },
                "filter": {
                    "my_synonym": {
                        "ignore_case": "true",
                        "expand": "true",
                        "type": "synonym",
                        "synonyms_path": "analysis/synonymOld.txt"
                    }
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "abstract": {
                "type": "text",
                "analyzer": "index_analyzer",
                "search_analyzer": "my_analyzer"
            }
        }
    }
}


GET test_index2/_analyze
{
   "analyzer":"my_analyzer",
   "text": "thiopental allergy is a test"
}

result

{
  "tokens": [
    {
      "token": "thiopental",
      "start_offset": 0,
      "end_offset": 10,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "penthiobarbital",
      "start_offset": 0,
      "end_offset": 10,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "pentothiobarbital",
      "start_offset": 0,
      "end_offset": 10,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "allergy",
      "start_offset": 11,
      "end_offset": 18,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "allergy",
      "start_offset": 11,
      "end_offset": 18,
      "type": "SYNONYM",
      "position": 1
    },
    {
      "token": "allergy",
      "start_offset": 11,
      "end_offset": 18,
      "type": "SYNONYM",
      "position": 1
    },
    {
      "token": "is",
      "start_offset": 19,
      "end_offset": 21,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "a",
      "start_offset": 22,
      "end_offset": 23,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "test",
      "start_offset": 24,
      "end_offset": 28,
      "type": "<ALPHANUM>",
      "position": 4
    }
  ]
}

Neither of these results met my needs

dadoonet · August 15, 2023, 11:01am

May be adding shingles before the synonyms token filter would help?

1057888035 · August 17, 2023, 1:52am

result this error: Token filter [shingle] cannot be used to parse synonyms

PUT test_index
{
    "settings": {
        "index": {
            "number_of_shards": 5,
            "number_of_replicas": 1,
            "analysis": {
                "analyzer": {   
                    "index_analyzer": {
                        "tokenizer": "standard",
                        "type": "custom"
                    },
                    "my_analyzer": {
                        "tokenizer": "whitespace",
                        "filter": ["shingle","my_synonym"],
                        "type": "custom"
                    }
                },
                "filter": {
                    "my_synonym": {
                        "ignore_case": "true",
                        "expand": "true",
                        "type": "synonym",
                        "synonyms_path": "analysis/synonym.txt"
                    }
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "abstract": {
                "type": "text",
                "analyzer": "index_analyzer",
                "search_analyzer": "my_analyzer"
            }
        }
    }
}

system · September 14, 2023, 1:53am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Length filter: array index out of bounds exception Elasticsearch	5	5863	April 27, 2020
Index 128 out of bounds for length 128 Elasticsearch	2	732	November 29, 2021
What happens with analyzed fields if the length of indexed field is longer than the "max_gram" of an edge_ngram? Elasticsearch	1	266	August 20, 2019
Indexing very long word Elasticsearch	1	486	April 22, 2020
Analyzers and Content Length Elasticsearch	8	914	May 7, 2020

Index 256 out of bounds for length 256

Related topics