Elastic tokenizer customization question

How can I use a tokenizer which is similar to the Word Delimiter Graph Token tokenizer but without using the following rules:
• Split tokens at letter case transitions. For example: PowerShot → Power, Shot
• Split tokens at letter-number transitions. For example: XL500 → XL, 500
• Remove the English possessive ('s) from the end of each token. For example: Neil's → Neil

So from the example: "Neil's-Super-Duper-XL500--42+AutoCoder"
instead of these tokens:
[ Neil, Super, Duper, XL, 500, 42, Auto, Coder ]
the analyzer need to produce these tokens:
[Neil, s, Super, Duper, XL500, 42, AutoCoder]

Thanks, Attila

I've found the solution on elasticsearch site.

Thanks for sharing your solution :slight_smile:

Thanks @warkolm,
my problem is when I try to search for a text field which contains concatenated text and numbers in this formula: "{text}{number}" like DOC0000000009 then I don't know how to search on them with queries like these:

  • DOC0000000009 : For this I tried to use SpanTermQuery, MatchQuery without success
  • DOC000000004? : For this I tried to use WildcardQuery without success

At indexing time I set WordDelimiterGraphTokenFilter and lowercase filter for this text field analyzer and search analyzer property.
The queries work only with lowercase letters like doc0000000009, doc000000004 even if I try to use MatchQuery with setting the same analyzer.

I am only able to execute these queries with QuerystringQuery but if I use then I cannot use a ProximityQuery which contains QueryStringQuery. Therefore I can use proximity queries with lowercased queries.

Could You please help in that?

Thanks, Attila

