Wordpiece tokenizer

Quentin_Feuillade · February 4, 2022, 10:30am

I am trying to use Worpiece tokenization (the tokenization method used in BERT) directly as a elastic tokenizer.

After some digging i found this file in the official elastic source code but i can't find any way to use it in Elasticsearch or any doc.

Any clue, or plugin i need to use ?

P.S. my elastic level is pretty low

arthur-caillaud · February 4, 2022, 10:33am

Also interested in using this type of tokenizer. Thank you for asking.

droberts195 · February 4, 2022, 10:40am

You cannot do what you want today, but we are working on making this possible in the future.

https://github.com/elastic/elasticsearch/pull/82870 will migrate the custom tokenization code that wasn't in the Lucene tokenization framework into the Lucene tokenization framework.

Then we will need a subsequent change to make that functionality reusable in other places in Elasticsearch where you can access tokenizers.

So eventually what you want to do should be possible, but not at the moment unfortunately.

Quentin_Feuillade · February 28, 2022, 8:53am

Hello, with the new update of elastic to 8.0, i wonder if it would not be possible to create a tokenization model with Eland and then upload the output to a custom field that would play the role of the tokenizer output.

Since last time, my needed as shifted a little bit and i have trained my own tokenizers, one for each supported language, and i would need to select tokenizer depending on language detection ?

Any idea if it is doable today and/or how it could be done ?

system · March 28, 2022, 8:53am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Using a Python Tokenizer Elasticsearch	3	749	February 27, 2018
Elastic Search Tokenizer (for tf-idf) Elasticsearch	8	724	July 6, 2017
Ingest Processor Plugin with Analyzer Support Elasticsearch	5	848	April 13, 2018
Implementing Custom BERT-Based Text Embedding Model for Semantic Search in Elasticsearch Elasticsearch	2	497	January 12, 2024
Custom Elasticsearch Analyzer :- Tokenization and Detokenize Text Processing Elasticsearch	0	66	May 27, 2024

Wordpiece tokenizer

Related topics