Index 256 out of bounds for length 256

Hi Team,
I am trying to execute api like this in elasticsearch

GET /_analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": ["is a test data is a test data is a test data is a test data is a test data is a test data is a test data is a test data is a test data is a test data is a test data is a test data is a test data is a test data is a test data is a test data is a test data  is a test data is a test data is a test data is a test data is a test data is a test data is a test data is a test data is a test data is a test data => test"]
    }
  ],
  
  "text": "heart malignant hemangiopericytoma is a test data"
}

echo error: "Index 256 out of bounds for length 256",
This error mabye that the mappings length cannot exceed 256, but my project needs more

Replacing really long strings is not necessarily what a char filter is designed for. What is the problem you are trying to solve using analysis with this char filter (at a higher level)? Can you please elaborate on the problem and use case and also provide a realistic example of how you want your data analyzed?

I have a project
"benzylpenicillin allergy" is a keyword , this equals "benzyl penicillin allergy".

Now, there is a text

POST /index/_doc
{
    "text": "benzyl penicillin allergy should not be used in tissues with poor blood flow. If allergic symptoms occur (e.g. skin rash, itching, shortness of breath), tell a doctor immediately . Before treatment, a hypersensitivity test should be performed if possible."
}

I want this document ,when I search for "benzylpenicillin allergy" .

I use standard analyzer, it will be analyze to "benzylpenicillin, allergy" ,
which is not feasible , be because I still have some document containing "allergy" or "penicillin" .

So, I removed the space between them with the mappings , and "benzylpenicillinallergy" is equal to "benzylpenicillinallergy" use alias filter

Do you have a better way ? Thanks

more:

"isoniazide allergy" is equal to "inh allergy" and is equal to "isonicotinylhydrazide allergy".

superficial mycosis“ is equal to "steroid-modified tinea infection" and is equal to "piedra".

What about using synonyms?

1 Like

I tried synonyms but it doesn't solve the problem.

It won't work when my synonym contains spaces.

Like this:
analysis/synonym.txt

thiopental allergy,penthiobarbital allergy,pentothiobarbital allergy
d-mannitol allergy,mannitol allergy
cefotaxime allergy
cephalosporin allergy
amodiaquine allergy,camoquin allergy,flavoquine allergy

The result I want is

GET test_index/_analyze
{
   "analyzer":"my_analyzer",
   "text": "thiopental allergy is a test"
}

result:

{
  "tokens": [
    {
      "token": "thiopental allergy",
      "start_offset": 0,
      "end_offset": 18,
      "type": "word",
      "position": 0
    },
    {
      "token": "penthiobarbital allergy",
      "start_offset": 0,
      "end_offset": 18,
      "type": "word",
      "position": 0
    },
    {
      "token": "pentothiobarbital allergy",
      "start_offset": 0,
      "end_offset": 18,
      "type": "word",
      "position": 0
    },
    {
      "token": "is",
      "start_offset": 19,
      "end_offset": 21,
      "type": "word",
      "position": 0
    },
    {
      "token": "a",
      "start_offset": 22,
      "end_offset": 23,
      "type": "word",
      "position": 0
    },
    {
      "token": "test",
      "start_offset": 24,
      "end_offset": 28,
      "type": "word",
      "position": 0
    },
  ]
}

but, the truth is

1. when I use "," tokenizer and synonyms

PUT test_index
{
    "settings": {
        "index": {
            "number_of_shards": 5,
            "number_of_replicas": 1,
            "analysis": {
                "analyzer": {   
                    "index_analyzer": {
                        "tokenizer": "standard",
                        "filter": ["lowercase"],
                        "type": "custom"
                    },
                    "my_analyzer": {
                        "tokenizer": "comma",
                        "filter": ["my_synonym","lowercase"],
                        "type": "custom"
                    }
                },
                "filter": {
                    "my_synonym": {
                        "ignore_case": "true",
                        "expand": "true",
                        "type": "synonym",
                        "synonyms_path": "analysis/synonym.txt"
                    }
                },
                "tokenizer":{
					"comma":{
						"type": "pattern",
						"pattern":",|,"
					}
			    }
            }
        }
    },
    "mappings": {
        "properties": {
            "abstract": {
                "type": "text",
                "analyzer": "index_analyzer",
                "search_analyzer": "my_analyzer"
            }
        }
    }
}


GET test_index/_analyze
{
   "analyzer":"my_analyzer",
   "text": "thiopental allergy is a test"
}

result:

{
  "tokens": [
    {
      "token": "thiopental allergy is a test",
      "start_offset": 0,
      "end_offset": 28,
      "type": "word",
      "position": 0
    }
  ]
}

2. when I use standard tokenizer and synonyms

PUT test_index2
{
    "settings": {
        "index": {
            "number_of_shards": 5,
            "number_of_replicas": 1,
            "analysis": {
                "analyzer": {   
                    "index_analyzer": {
                        "tokenizer": "standard",
                        "filter": ["lowercase"],
                        "type": "custom"
                    },
                    "my_analyzer": {
                        "tokenizer": "standard",
                        "filter": ["my_synonym","lowercase"],
                        "type": "custom"
                    }
                },
                "filter": {
                    "my_synonym": {
                        "ignore_case": "true",
                        "expand": "true",
                        "type": "synonym",
                        "synonyms_path": "analysis/synonymOld.txt"
                    }
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "abstract": {
                "type": "text",
                "analyzer": "index_analyzer",
                "search_analyzer": "my_analyzer"
            }
        }
    }
}


GET test_index2/_analyze
{
   "analyzer":"my_analyzer",
   "text": "thiopental allergy is a test"
}

result

{
  "tokens": [
    {
      "token": "thiopental",
      "start_offset": 0,
      "end_offset": 10,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "penthiobarbital",
      "start_offset": 0,
      "end_offset": 10,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "pentothiobarbital",
      "start_offset": 0,
      "end_offset": 10,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "allergy",
      "start_offset": 11,
      "end_offset": 18,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "allergy",
      "start_offset": 11,
      "end_offset": 18,
      "type": "SYNONYM",
      "position": 1
    },
    {
      "token": "allergy",
      "start_offset": 11,
      "end_offset": 18,
      "type": "SYNONYM",
      "position": 1
    },
    {
      "token": "is",
      "start_offset": 19,
      "end_offset": 21,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "a",
      "start_offset": 22,
      "end_offset": 23,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "test",
      "start_offset": 24,
      "end_offset": 28,
      "type": "<ALPHANUM>",
      "position": 4
    }
  ]
}

Neither of these results met my needs

May be adding shingles before the synonyms token filter would help?

result this error: Token filter [shingle] cannot be used to parse synonyms

PUT test_index
{
    "settings": {
        "index": {
            "number_of_shards": 5,
            "number_of_replicas": 1,
            "analysis": {
                "analyzer": {   
                    "index_analyzer": {
                        "tokenizer": "standard",
                        "type": "custom"
                    },
                    "my_analyzer": {
                        "tokenizer": "whitespace",
                        "filter": ["shingle","my_synonym"],
                        "type": "custom"
                    }
                },
                "filter": {
                    "my_synonym": {
                        "ignore_case": "true",
                        "expand": "true",
                        "type": "synonym",
                        "synonyms_path": "analysis/synonym.txt"
                    }
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "abstract": {
                "type": "text",
                "analyzer": "index_analyzer",
                "search_analyzer": "my_analyzer"
            }
        }
    }
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.