Max length allowed for "max_token_length" and how to set value

Eli_Mintz · July 31, 2016, 4:15am

Hi,

I would like to set the standard tokenizer to use a token length of 30000 for a field (I am inserting biological sequences into this field) instead of cutting up the string into 255 length tokens as is the default. Can this be done? What is the JSON command to do this?

Thanks.

jprante · July 31, 2016, 8:12am

There is no such JSON command. Expanding token lengths over the default value have massive impact on indexing and search - the default token length is 32k.

What are you trying to do? Search for sequence alignments? If so, there are better solutions: imagine locality sensitive hashes http://de.slideshare.net/ZacharyTong/boston-meetupgoingorganic/60-hashdistributes_similar_values_into_unique

https://issues.apache.org/jira/browse/LUCENE-6968 would have to be backported to Elasticsearch 2.3+, in a plugin maybe.

taras · August 1, 2016, 1:35am

Another approach to consider is to split up the token yourself into smaller chains of importance, and then use a phrase search.

If you really do want only exact matches, you could insert it as a not_analyzed string.

Topic		Replies	Views
Pattern analyzer does not respect max_token_length Elasticsearch	2	758	July 5, 2017
Custom analyzer with standard tokenizer is splitting long tokens instead of discarding Elasticsearch	4	1193	July 5, 2017
Is there a way for ES to config maximum size for a single term? Elasticsearch	3	461	July 5, 2017
Query returning false results when term exceeds ngram length Elasticsearch	6	1559	January 16, 2018
How to limit token length? Elasticsearch	5	1853	April 24, 2017

Max length allowed for "max_token_length" and how to set value

Related topics