Wordpiece tokenizer

I am trying to use Worpiece tokenization (the tokenization method used in BERT) directly as a elastic tokenizer.

After some digging i found this file in the official elastic source code but i can't find any way to use it in Elasticsearch or any doc.

Any clue, or plugin i need to use ?

P.S. my elastic level is pretty low

2 Likes

Also interested in using this type of tokenizer. Thank you for asking.

You cannot do what you want today, but we are working on making this possible in the future.

https://github.com/elastic/elasticsearch/pull/82870 will migrate the custom tokenization code that wasn't in the Lucene tokenization framework into the Lucene tokenization framework.

Then we will need a subsequent change to make that functionality reusable in other places in Elasticsearch where you can access tokenizers.

So eventually what you want to do should be possible, but not at the moment unfortunately.

Hello, with the new update of elastic to 8.0, i wonder if it would not be possible to create a tokenization model with Eland and then upload the output to a custom field that would play the role of the tokenizer output.

Since last time, my needed as shifted a little bit and i have trained my own tokenizers, one for each supported language, and i would need to select tokenizer depending on language detection ?

Any idea if it is doable today and/or how it could be done ?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.