Building a custom tokenizer: "Could not find suitable constructor"

ndtreviv · September 20, 2017, 9:11am

I'm building a custom tokenizer in response to this: Performance of doc_values field vs analysed field

None of this API appears to be documented (?), so I'm going off of code samples from other plugins/tokenizers, but when I restart elastic having deployed my tokenizer I get this error constantly in the logs:

[2017-09-20 08:45:37,412][WARN ][indices.cluster          ] [Samuel Silke] [[storm-crawler-2017-09-11][3]] marking and sending shard failed due to [failed to create index]
[storm-crawler-2017-09-11] IndexCreationException[failed to create index]; nested: CreationException[Guice creation errors:

1) Could not find a suitable constructor in com.cameraforensics.elasticsearch.plugins.UrlTokenizerFactory. Classes must have either one (and only one) constructor annotated with @Inject or a zero-argument constructor that is not private.
  at com.cameraforensics.elasticsearch.plugins.UrlTokenizerFactory.class(Unknown Source)
  at org.elasticsearch.index.analysis.TokenizerFactoryFactory.create(Unknown Source)
  at org.elasticsearch.common.inject.assistedinject.FactoryProvider2.initialize(Unknown Source)
  at _unknown_

1 error];
	at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:360)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyNewIndices(IndicesClusterStateService.java:294)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:163)
	at org.elasticsearch.cluster.service.InternalClusterService.runTasksForExecutor(InternalClusterService.java:610)
	at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:772)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.elasticsearch.common.inject.CreationException: Guice creation errors:

1) Could not find a suitable constructor in com.cameraforensics.elasticsearch.plugins.UrlTokenizerFactory. Classes must have either one (and only one) constructor annotated with @Inject or a zero-argument constructor that is not private.
  at com.cameraforensics.elasticsearch.plugins.UrlTokenizerFactory.class(Unknown Source)
  at org.elasticsearch.index.analysis.TokenizerFactoryFactory.create(Unknown Source)
  at org.elasticsearch.common.inject.assistedinject.FactoryProvider2.initialize(Unknown Source)
  at _unknown_

1 error
	at org.elasticsearch.common.inject.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:360)
	at org.elasticsearch.common.inject.InjectorBuilder.injectDynamically(InjectorBuilder.java:172)
	at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:110)
	at org.elasticsearch.common.inject.InjectorImpl.createChildInjector(InjectorImpl.java:157)
	at org.elasticsearch.common.inject.ModulesBuilder.createChildInjector(ModulesBuilder.java:55)
	at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:358)
	... 9 more

My tokenizer is built for v2.3.4, and the TokenizerFactory looks like this:

public class UrlTokenizerFactory extends AbstractTokenizerFactory {

    @Inject
    public UrlTokenizerFactory(Index index, IndexSettingsService indexSettings, @Assisted String name, @Assisted Settings settings){
        super(index, indexSettings.getSettings(), name, settings);
    }

    @Override
    public Tokenizer create() {
        return new UrlTokenizer();
    }
}

I genuinely don't know what I'm doing wrong. Have I deployed it incorrectly? It appears to be using my classes according to the logs...

I've only deployed it to one of my es nodes (4-node cluster). The /_cat/plugins?v endpoint gives this:

name         component          version type url 
Samuel Silke urltokenizer       2.3.4.0 j

Can anyone help? What am I doing wrong?

rjernst · September 20, 2017, 3:51pm

What are you the entire contents of your tokenizer .java file (in particular, the imports)? Note that Tokenizers were change in 5.0 to be "deguiced", so they should no longer have these types of pains when developing.

ndtreviv · September 20, 2017, 5:59pm

Here are the entire contents of my TokenizerFactory class (sans package):

import org.apache.lucene.analysis.Tokenizer;
import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.inject.assistedinject.Assisted;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.index.Index;
import org.elasticsearch.index.analysis.AbstractTokenizerFactory;
import org.elasticsearch.index.settings.IndexSettingsService;

public class UrlTokenizerFactory extends AbstractTokenizerFactory {

    @Inject
    public UrlTokenizerFactory(Index index, IndexSettingsService indexSettings, @Assisted String name, @Assisted Settings settings){
        super(index, indexSettings.getSettings(), name, settings);
    }

    @Override
    public Tokenizer create() {
        return new UrlTokenizer();
    }
}

Do you need my tokenizer as well?

Remember, I'm developing for v.2.3.4 not 5.0.

Thanks for any help!

rjernst · September 20, 2017, 10:08pm

How are you starting elasticsearch?

ndtreviv · September 20, 2017, 10:23pm

sudo service elasticsearch restart

rjernst · September 21, 2017, 8:49pm

Sorry I don't have any other ideas. Guice is a nightmare, that is why we have been slowly removing it. You should really upgrade to 5.x

ndtreviv · September 22, 2017, 1:25pm

Oh.

Any chance you could give me some guidance on how I might re-create the guice stuff to see why that's failing?

Ivan · September 22, 2017, 3:45pm

A shot in the dark, but are you sure that the IndexSettingsService is used
in the constructor? Try using the index settings directly (which should be
injected):

public UrlTokenizerFactory(Index index, @IndexSettings Settings
indexSettings, @Assisted String name, @Assisted Settings settings)

rjernst · September 22, 2017, 7:14pm

The guice stuff is all initialized in Node.java. At least, in there you can see which Module classes are loaded, which setup bindings in guice. It seems to find your class ok, but then does not find the constructor. Is the snippet you gave the entirety of your tokenizer factory class? I had first thought maybe you were using the Inject annotation from javax or something like that, which is why I asked to see the imports. But the classes in your ctor parameters seem to match up with those in other tokenizer factories.

ndtreviv · September 22, 2017, 7:31pm

Thanks for your help.

I tried your "shot in the dark". It compiled, and I got a slightly different error this time:

Caused by: java.lang.IllegalStateException: [index.version.created] is not present in the index settings for index with uuid: [null]
	at org.elasticsearch.Version.indexCreated(Version.java:584)
	at org.elasticsearch.index.analysis.Analysis.parseAnalysisVersion(Analysis.java:99)
	at org.elasticsearch.index.analysis.AbstractTokenizerFactory.<init>(AbstractTokenizerFactory.java:40)

I'll keep digging

ndtreviv · September 22, 2017, 7:40pm

Although this annotation doesn't resolve.

ndtreviv · September 22, 2017, 7:42pm

Just found one here: https://github.com/codelibs/elasticsearch-analysis-kuromoji-neologd/blob/2.3.x/src/main/java/org/codelibs/elasticsearch/kuromoji/neologd/index/analysis/KuromojiTokenizerFactory.java that has Environment in the constructor. Trying that!

ndtreviv · September 22, 2017, 7:44pm

Well, it seems to have worked...now to see if it, you know, actually works...

ndtreviv · September 22, 2017, 7:55pm

Yup. It's calling my tokenizer. But now it's revealed that my tokenizer is in fact crap!

Caused by: java.lang.IndexOutOfBoundsException
	at org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.append(CharTermAttributeImpl.java:131)
	at com.cameraforensics.elasticsearch.plugins.UrlTokenizer.incrementToken(UrlTokenizer.java:30)

Probably because - as there are no docs - I'm doing it wrong.

    @Override
    public boolean incrementToken() throws IOException {
        if (position >= tokens.size()) {
            return false;
        } else {
            termAtt.setEmpty().append(tokens.get(position), position, position);
            position++;
            return true;
        }
    }

tokens is a list of all permutations of index segmentation (as per this: Performance of doc_values field vs analysed field)

I'm not really sure what the two int values should be on CharTermAttribute#append, so I'm guessing - incorrectly.

Anyway, thanks for all of your help. I'll keep hacking!

ndtreviv · September 22, 2017, 7:59pm

Nailed it. Thanks again!

ndtreviv · September 25, 2017, 2:13pm

PS: If you want to continue with me on my Journey of Pain: Custom tokenizer doesn't work on reindex/index api, only _analyze endpoint

system · October 23, 2017, 2:19pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Adding analyzers Elasticsearch	4	525	July 6, 2017
Guice creation errors: org.elasticsearch.Version Elasticsearch	6	2389	July 5, 2017
Integration Test : org.elasticsearch.common.inject.CreationException: Guice creation errors: Elasticsearch	1	520	July 2, 2018
Issue while creating custom analyzers for index Elasticsearch	2	816	February 26, 2018
Custom TokenFilter Plugin Class Initialization and Parameter Validation Elasticsearch	1	454	April 18, 2017

Building a custom tokenizer: "Could not find suitable constructor"

Related topics