Building a custom tokenizer: "Could not find suitable constructor"

I'm building a custom tokenizer in response to this: Performance of doc_values field vs analysed field

None of this API appears to be documented (?), so I'm going off of code samples from other plugins/tokenizers, but when I restart elastic having deployed my tokenizer I get this error constantly in the logs:

[2017-09-20 08:45:37,412][WARN ][indices.cluster          ] [Samuel Silke] [[storm-crawler-2017-09-11][3]] marking and sending shard failed due to [failed to create index]
[storm-crawler-2017-09-11] IndexCreationException[failed to create index]; nested: CreationException[Guice creation errors:

1) Could not find a suitable constructor in com.cameraforensics.elasticsearch.plugins.UrlTokenizerFactory. Classes must have either one (and only one) constructor annotated with @Inject or a zero-argument constructor that is not private.
  at com.cameraforensics.elasticsearch.plugins.UrlTokenizerFactory.class(Unknown Source)
  at org.elasticsearch.index.analysis.TokenizerFactoryFactory.create(Unknown Source)
  at org.elasticsearch.common.inject.assistedinject.FactoryProvider2.initialize(Unknown Source)
  at _unknown_

1 error];
	at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:360)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyNewIndices(IndicesClusterStateService.java:294)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:163)
	at org.elasticsearch.cluster.service.InternalClusterService.runTasksForExecutor(InternalClusterService.java:610)
	at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:772)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.elasticsearch.common.inject.CreationException: Guice creation errors:

1) Could not find a suitable constructor in com.cameraforensics.elasticsearch.plugins.UrlTokenizerFactory. Classes must have either one (and only one) constructor annotated with @Inject or a zero-argument constructor that is not private.
  at com.cameraforensics.elasticsearch.plugins.UrlTokenizerFactory.class(Unknown Source)
  at org.elasticsearch.index.analysis.TokenizerFactoryFactory.create(Unknown Source)
  at org.elasticsearch.common.inject.assistedinject.FactoryProvider2.initialize(Unknown Source)
  at _unknown_

1 error
	at org.elasticsearch.common.inject.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:360)
	at org.elasticsearch.common.inject.InjectorBuilder.injectDynamically(InjectorBuilder.java:172)
	at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:110)
	at org.elasticsearch.common.inject.InjectorImpl.createChildInjector(InjectorImpl.java:157)
	at org.elasticsearch.common.inject.ModulesBuilder.createChildInjector(ModulesBuilder.java:55)
	at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:358)
	... 9 more

My tokenizer is built for v2.3.4, and the TokenizerFactory looks like this:

public class UrlTokenizerFactory extends AbstractTokenizerFactory {

    @Inject
    public UrlTokenizerFactory(Index index, IndexSettingsService indexSettings, @Assisted String name, @Assisted Settings settings){
        super(index, indexSettings.getSettings(), name, settings);
    }

    @Override
    public Tokenizer create() {
        return new UrlTokenizer();
    }
}

I genuinely don't know what I'm doing wrong. Have I deployed it incorrectly? It appears to be using my classes according to the logs...

I've only deployed it to one of my es nodes (4-node cluster). The /_cat/plugins?v endpoint gives this:

name         component          version type url 
Samuel Silke urltokenizer       2.3.4.0 j        

Can anyone help? What am I doing wrong?

What are you the entire contents of your tokenizer .java file (in particular, the imports)? Note that Tokenizers were change in 5.0 to be "deguiced", so they should no longer have these types of pains when developing.

Here are the entire contents of my TokenizerFactory class (sans package):

import org.apache.lucene.analysis.Tokenizer;
import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.inject.assistedinject.Assisted;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.index.Index;
import org.elasticsearch.index.analysis.AbstractTokenizerFactory;
import org.elasticsearch.index.settings.IndexSettingsService;

public class UrlTokenizerFactory extends AbstractTokenizerFactory {

    @Inject
    public UrlTokenizerFactory(Index index, IndexSettingsService indexSettings, @Assisted String name, @Assisted Settings settings){
        super(index, indexSettings.getSettings(), name, settings);
    }

    @Override
    public Tokenizer create() {
        return new UrlTokenizer();
    }
}

Do you need my tokenizer as well?

Remember, I'm developing for v.2.3.4 not 5.0.

Thanks for any help!

How are you starting elasticsearch?

sudo service elasticsearch restart

Sorry I don't have any other ideas. Guice is a nightmare, that is why we have been slowly removing it. You should really upgrade to 5.x :neutral_face:

Oh. :frowning:

Any chance you could give me some guidance on how I might re-create the guice stuff to see why that's failing?

A shot in the dark, but are you sure that the IndexSettingsService is used
in the constructor? Try using the index settings directly (which should be
injected):

public UrlTokenizerFactory(Index index, @IndexSettings Settings
indexSettings, @Assisted String name, @Assisted Settings settings)

The guice stuff is all initialized in Node.java. At least, in there you can see which Module classes are loaded, which setup bindings in guice. It seems to find your class ok, but then does not find the constructor. Is the snippet you gave the entirety of your tokenizer factory class? I had first thought maybe you were using the Inject annotation from javax or something like that, which is why I asked to see the imports. But the classes in your ctor parameters seem to match up with those in other tokenizer factories. :confused:

Thanks for your help.

I tried your "shot in the dark". It compiled, and I got a slightly different error this time:

Caused by: java.lang.IllegalStateException: [index.version.created] is not present in the index settings for index with uuid: [null]
	at org.elasticsearch.Version.indexCreated(Version.java:584)
	at org.elasticsearch.index.analysis.Analysis.parseAnalysisVersion(Analysis.java:99)
	at org.elasticsearch.index.analysis.AbstractTokenizerFactory.<init>(AbstractTokenizerFactory.java:40)

I'll keep digging

Although this annotation doesn't resolve.

Just found one here: https://github.com/codelibs/elasticsearch-analysis-kuromoji-neologd/blob/2.3.x/src/main/java/org/codelibs/elasticsearch/kuromoji/neologd/index/analysis/KuromojiTokenizerFactory.java that has Environment in the constructor. Trying that!

Well, it seems to have worked...now to see if it, you know, actually works...

Yup. It's calling my tokenizer. But now it's revealed that my tokenizer is in fact crap!

Caused by: java.lang.IndexOutOfBoundsException
	at org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.append(CharTermAttributeImpl.java:131)
	at com.cameraforensics.elasticsearch.plugins.UrlTokenizer.incrementToken(UrlTokenizer.java:30)

Probably because - as there are no docs - I'm doing it wrong.

    @Override
    public boolean incrementToken() throws IOException {
        if (position >= tokens.size()) {
            return false;
        } else {
            termAtt.setEmpty().append(tokens.get(position), position, position);
            position++;
            return true;
        }
    }

tokens is a list of all permutations of index segmentation (as per this: Performance of doc_values field vs analysed field)

I'm not really sure what the two int values should be on CharTermAttribute#append, so I'm guessing - incorrectly.

Anyway, thanks for all of your help. I'll keep hacking!

Nailed it. Thanks again!

PS: If you want to continue with me on my Journey of Pain: Custom tokenizer doesn't work on reindex/index api, only _analyze endpoint :wink:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.