Elastic tokenizer customization question

How can I use a tokenizer which is similar to the Word Delimiter Graph Token tokenizer but without using the following rules:
• Split tokens at letter case transitions. For example: PowerShot → Power, Shot
• Split tokens at letter-number transitions. For example: XL500 → XL, 500
• Remove the English possessive ('s) from the end of each token. For example: Neil's → Neil

So from the example: "Neil's-Super-Duper-XL500--42+AutoCoder"
instead of these tokens:
[ Neil, Super, Duper, XL, 500, 42, Auto, Coder ]
the analyzer need to produce these tokens:
[Neil, s, Super, Duper, XL500, 42, AutoCoder]

Thanks, Attila

I've found the solution on elasticsearch site.

Thanks for sharing your solution :slight_smile:

1 Like

Thanks @warkolm,
my problem is when I try to search for a text field which contains concatenated text and numbers in this formula: "{text}{number}" like DOC0000000009 then I don't know how to search on them with queries like these:

  • DOC0000000009 : For this I tried to use SpanTermQuery, MatchQuery without success
  • DOC000000004? : For this I tried to use WildcardQuery without success

At indexing time I set WordDelimiterGraphTokenFilter and lowercase filter for this text field analyzer and search analyzer property.
The queries work only with lowercase letters like doc0000000009, doc000000004 even if I try to use MatchQuery with setting the same analyzer.

I am only able to execute these queries with QuerystringQuery but if I use then I cannot use a ProximityQuery which contains QueryStringQuery. Therefore I can use proximity queries with lowercased queries.

Could You please help in that?

Thanks, Attila

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.