Custom tokenizer doesn't work on reindex/index api, only _analyze endpoint

I've written a custom tokenizer. It was hard (Building a custom tokenizer: "Could not find suitable constructor"), nevertheless I persevered.

I created an index with it configured:

      "analysis":{
            "analyzer":{
                "urlanalyzer":{ 
                    "type": "custom",
                    "tokenizer": "urltokenizer",
                    "filter": "urlstopwords"
                }
            },
            "filter": {
                "urlstopwords": {
                    "type": "stop",
                    "stopwords": ["http:", "https:"]
                }
            }
        }

When I run the _analyze tester against it, it all works, for example:

POST /myindex/_analyze
{
    "text": "https://www.domain.com/one/two/three/four.jpg",
    "tokenizer": "urltokenizer"
}

Response:

{
    "tokens": [
        {
            "token": "https://www.domain.com/one/two/three/four.jpg",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 0
        },
        {
            "token": "www.domain.com/one",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 1
        },
        {
            "token": "www.domain.com/one/two",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 2
        },
        {
            "token": "three/four.jpg",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 3
        },
        {
            "token": "www.domain.com/one/two/three",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 4
        },
        {
            "token": "one",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 5
        },
        {
            "token": "https://www.domain.com/one",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 6
        },
        {
            "token": "https://www.domain.com/one/two/three",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 7
        },
        {
            "token": "one/two",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 8
        },
        {
            "token": "two",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 9
        },
        {
            "token": "three",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 10
        },
        {
            "token": "https://www.domain.com/one/two",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 11
        },
        {
            "token": "www.domain.com/one/two/three/four.jpg",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 12
        },
        {
            "token": "one/two/three/four.jpg",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 13
        },
        {
            "token": "two/three/four.jpg",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 14
        },
        {
            "token": "/one/two/three/four.jpg",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 15
        },
        {
            "token": "two/three",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 16
        },
        {
            "token": "one/two/three",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 17
        },
        {
            "token": "four.jpg",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 18
        },
        {
            "token": "www.domain.com",
            "start_offset": 0,
            "end_offset": 0,
            "type": "word",
            "position": 19
        }
    ]
}

(Should the offsets all be 0?)

If I check the term vectors using a test document, it comes back with these terms.

However, if I index a document into an index that uses the analyzer with the following mappings:

...
               "source": {
                    "type": "string",
                    "index": "not_analyzed",
                    "copy_to": "filename" 
                },
                "url": {
                    "type": "string",
                    "index": "not_analyzed",
                    "copy_to": "filename" 
                },
                "filename": {
                  "type": "string",
                  "analyzer": "urlanalyzer",
                  "store": false
                },
...

it comes back with no terms. There are no errors in the logs.

Also, if I try to use the reindex api to move data from another index into this one, it fails with the same error on every document:

            "cause": {
                "type": "illegal_argument_exception",
                "reason": "first position increment must be > 0 (got 0) for field 'filename'"
            },
            "status": 400

Is it because I'm calling CharTermAttribute#append with the wrong parameters? Here is my tokenizer:

public class UrlTokenizer extends Tokenizer {

    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

    protected List<String> tokens = new ArrayList<>();

    protected String stringToTokenize;

    protected int position = 0;
    ...

    @Override
    public boolean incrementToken() throws IOException {
        if (position >= tokens.size()) {
            return false;
        } else {
            String token = tokens.get(position);
            termAtt.setEmpty().append(token, stringToTokenize.indexOf(token), token.length());
            position++;
            return true;
        }
    }

    final char[] buffer = new char[8192];
    private void fillBuffer(Reader input) throws IOException {
        int len;
        StringBuilder str = new StringBuilder();
        str.setLength(0);
        while ((len = input.read(buffer)) > 0) {
            str.append(buffer, 0, len);
        }
        stringToTokenize = str.toString();
        tokens = splitUrl(stringToTokenize); //Method to split into segmentation permutations
    }

}

Can anyone tell me what I'm doing wrong?

Thanks for any help at all!

Here is the full class in case that helps (be gentle - I've been hacking this for days trying to use an undocumented API):

public class UrlTokenizer extends Tokenizer {

    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

    protected List<String> tokens = new ArrayList<>();

    protected String stringToTokenize;

    protected int position = 0;

    public UrlTokenizer() {
        super(DEFAULT_TOKEN_ATTRIBUTE_FACTORY);
    }

    @Override
    public boolean incrementToken() throws IOException {
        if (position >= tokens.size()) {
            return false;
        } else {
            String token = tokens.get(position);
            termAtt.setEmpty().append(token, stringToTokenize.indexOf(token), token.length());
            position++;
            return true;
        }
    }

    private List<String> splitUrl(String url) throws MalformedURLException {
        URL realUrl = new URL(url);
        Set<String> parts = new HashSet<>();
        String host = realUrl.getHost();
        parts.add(host);

        String query = realUrl.getQuery();
        if (query != null && !"".equals(query.trim())) {
            parts.addAll(Arrays.asList(realUrl.getQuery().split("&")));
            parts.add(realUrl.getQuery());
            parts.add(url.substring(0, url.length() - (query.length() + 1)));
        }

        String path = realUrl.getPath();
        if (path != null && !"".equals(path.trim())) {
            parts.add(realUrl.getPath());
            String[] pathSegments = realUrl.getPath().substring(1).split("/");

            int idx = url.indexOf(realUrl.getHost() + "/") + (realUrl.getHost() + "/").length();
            String upToHost = url.substring(0, idx);

            for (int i = 0; i < pathSegments.length; i++) {
                parts.add(pathSegments[i]);
                for (int j = i+1; j <= pathSegments.length; j++) {
                    String permutation = join(Arrays.copyOfRange(pathSegments, i, j), "/"); // + (j == pathSegments.length ? "" : "/");
                    parts.add(/* "/" + */permutation);
                    if (i == 0) {
                        parts.add(upToHost + permutation);
                        parts.add(host + "/" + permutation);
                    }

                }

            }


        }

        return new ArrayList<>(parts);
    }

    public String join(String[] values, String delimiter) {
        StringBuffer strbuf = new StringBuffer();

        boolean first = true;

        for (String value : values) {
            if (!first) { strbuf.append(delimiter); } else { first = false; }
            strbuf.append(value);
        }

        return strbuf.toString();
    }
//
//    @Override
//    public void end() throws IOException {
//        super.end();
//        final int ofs = correctOffset(tokens.size());
//        offsetAtt.setOffset(ofs, ofs);
//    }

    @Override
    public void close() throws IOException {
        try {
            super.close();
        } finally {
            tokens.clear();
        }
    }

    @Override
    public void reset() throws IOException {
        super.reset();
        fillBuffer(input);
        position = 0;
    }

    final char[] buffer = new char[8192];
    private void fillBuffer(Reader input) throws IOException {
        int len;
        StringBuilder str = new StringBuilder();
        str.setLength(0);
        while ((len = input.read(buffer)) > 0) {
            str.append(buffer, 0, len);
        }
        stringToTokenize = str.toString();
        tokens = splitUrl(stringToTokenize);
    }

}

A couple things I notice:

  • You don't need to call the append method that takes start/end, since you already have the entire token.
  • You need to call clearAttributes() in the first call to incrementToken() (before setting any attribute values).
  • Your offsets are all 0 because you don't have a start/end offset attribute.

Awesome - thanks!
So, do I even need to set start/end offset attributes?

I have tried adding this:

    private final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
...
            int startOffset = stringToTokenize.indexOf(token);
            offsetAtt.setOffset(startOffset, startOffset + token.length());

but am getting this error:

Caused by: java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset, startOffset=-1,endOffset=29

I think this is because the field I am tokenizing is created by copy_to from two other fields. I think the indexOf is returning -1 because it's looking for a token from one of the fields in the other field's content?

Also, just to confirm - do you mean to say that I should call clearAttributes() every time incrementToken is called? Or just the first time?

Thanks again!

Every time. See for example PatternTokenizer in lucene, and also the javadocs on Tokenizer.

do I even need to set start/end offset attributes?

It depends. If you don't need offsets (for example, to do highlighting), then no.

I think this is because the field I am tokenizing is created by copy_to from two other fields. I think the indexOf is returning -1 because it's looking for a token from one of the fields in the other field's content?

I'm not sure what you mean by this. The offsets should be based on the data coming into the tokenizer. A copy_to copies the data, and is handled outside of the analyzer, so a tokenizer should see no difference between values passed directly to a field, or copied from another field. I also would not use indexOf anyways: it has linear time lookup. If you want offsets, keep track of them when doing your splitting.

1 Like

Great advice. Thank you!

So, I reindexed 10m documents into an index that used my tokenizer. Unfortunately only the url field terms made it across, not the source field terms.

Both fields are copy_to-d into the filename field:

...
               "source": {
                    "type": "string",
                    "index": "not_analyzed",
                    "copy_to": "filename" 
                },
                "url": {
                    "type": "string",
                    "index": "not_analyzed",
                    "copy_to": "filename" 
                },
                "filename": {
                  "type": "string",
                  "analyzer": "urlanalyzer",
                  "store": false
                },
...

I'm getting the string to tokenize in the reset method:

    @Override
    public void reset() throws IOException {
        super.reset();
        fillBuffer(input);
        position = 0;
    }

    final char[] buffer = new char[8192];
    private void fillBuffer(Reader input) throws IOException {
        int len;
        StringBuilder str = new StringBuilder();
        str.setLength(0);
        while ((len = input.read(buffer)) > 0) {
            str.append(buffer, 0, len);
        }
        stringToTokenize = str.toString();
        tokens = splitUrl(stringToTokenize);
    }

and clearing the tokens in the close method:

    @Override
    public void close() throws IOException {
        try {
            super.close();
        } finally {
            tokens.clear();
        }
    }

but not really doing anything with end.
My guess is that I need to clear the tokens in end, not close...or set position back to 0 on close not reset.

Man, I really wish I understood how this worked!

You should read the Lucene docs for analysis components.

And probably also the TokenStream docs, which Tokenizer implements.

1 Like

That's really great - thanks!

It turns out that I had created the index with an old template, so the source field wasn't being copy_to-d after all. D'oh!

Thanks again for your help.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.