Elastic tokenizer customization question

Attila816 · May 19, 2020, 5:00pm

How can I use a tokenizer which is similar to the Word Delimiter Graph Token tokenizer but without using the following rules:
• Split tokens at letter case transitions. For example: PowerShot → Power, Shot
• Split tokens at letter-number transitions. For example: XL500 → XL, 500
• Remove the English possessive ('s) from the end of each token. For example: Neil's → Neil

So from the example: "Neil's-Super-Duper-XL500--42+AutoCoder"
instead of these tokens:
[ Neil, Super, Duper, XL, 500, 42, Auto, Coder ]
the analyzer need to produce these tokens:
[Neil, s, Super, Duper, XL500, 42, AutoCoder]

Thanks, Attila

Attila816 · May 19, 2020, 9:29pm

I've found the solution on elasticsearch site.

warkolm · May 19, 2020, 11:11pm

Thanks for sharing your solution

Attila816 · May 20, 2020, 8:48am

Thanks @warkolm,
my problem is when I try to search for a text field which contains concatenated text and numbers in this formula: "{text}{number}" like DOC0000000009 then I don't know how to search on them with queries like these:

DOC0000000009 : For this I tried to use SpanTermQuery, MatchQuery without success
DOC000000004? : For this I tried to use WildcardQuery without success

At indexing time I set WordDelimiterGraphTokenFilter and lowercase filter for this text field analyzer and search analyzer property.
The queries work only with lowercase letters like doc0000000009, doc000000004 even if I try to use MatchQuery with setting the same analyzer.

I am only able to execute these queries with QuerystringQuery but if I use then I cannot use a ProximityQuery which contains QueryStringQuery. Therefore I can use proximity queries with lowercased queries.

Could You please help in that?

Thanks, Attila

system · June 17, 2020, 8:48am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Word_delimiter behaviour using match query with operator and Elasticsearch	1	202	September 26, 2022
Looking for a phrase tokenizer or filter like this Elastic Search	4	228	November 2, 2022
Pattern analyzer regex help Elasticsearch	3	249	August 24, 2022
Configuring the standard tokenizer Elasticsearch	8	15185	July 5, 2017
Why word_delimiter doesn's work on my index? Elasticsearch	5	345	March 17, 2021

Elastic tokenizer customization question

Related Topics