I've written a custom tokenizer. It was hard (Building a custom tokenizer: "Could not find suitable constructor"), nevertheless I persevered.
I created an index with it configured:
"analysis":{
"analyzer":{
"urlanalyzer":{
"type": "custom",
"tokenizer": "urltokenizer",
"filter": "urlstopwords"
}
},
"filter": {
"urlstopwords": {
"type": "stop",
"stopwords": ["http:", "https:"]
}
}
}
When I run the _analyze
tester against it, it all works, for example:
POST /myindex/_analyze
{
"text": "https://www.domain.com/one/two/three/four.jpg",
"tokenizer": "urltokenizer"
}
Response:
{
"tokens": [
{
"token": "https://www.domain.com/one/two/three/four.jpg",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 0
},
{
"token": "www.domain.com/one",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 1
},
{
"token": "www.domain.com/one/two",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 2
},
{
"token": "three/four.jpg",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 3
},
{
"token": "www.domain.com/one/two/three",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 4
},
{
"token": "one",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 5
},
{
"token": "https://www.domain.com/one",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 6
},
{
"token": "https://www.domain.com/one/two/three",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 7
},
{
"token": "one/two",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 8
},
{
"token": "two",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 9
},
{
"token": "three",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 10
},
{
"token": "https://www.domain.com/one/two",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 11
},
{
"token": "www.domain.com/one/two/three/four.jpg",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 12
},
{
"token": "one/two/three/four.jpg",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 13
},
{
"token": "two/three/four.jpg",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 14
},
{
"token": "/one/two/three/four.jpg",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 15
},
{
"token": "two/three",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 16
},
{
"token": "one/two/three",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 17
},
{
"token": "four.jpg",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 18
},
{
"token": "www.domain.com",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 19
}
]
}
(Should the offsets all be 0?)
If I check the term vectors using a test document, it comes back with these terms.
However, if I index a document into an index that uses the analyzer with the following mappings:
...
"source": {
"type": "string",
"index": "not_analyzed",
"copy_to": "filename"
},
"url": {
"type": "string",
"index": "not_analyzed",
"copy_to": "filename"
},
"filename": {
"type": "string",
"analyzer": "urlanalyzer",
"store": false
},
...
it comes back with no terms. There are no errors in the logs.
Also, if I try to use the reindex api to move data from another index into this one, it fails with the same error on every document:
"cause": {
"type": "illegal_argument_exception",
"reason": "first position increment must be > 0 (got 0) for field 'filename'"
},
"status": 400
Is it because I'm calling CharTermAttribute#append
with the wrong parameters? Here is my tokenizer:
public class UrlTokenizer extends Tokenizer {
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
protected List<String> tokens = new ArrayList<>();
protected String stringToTokenize;
protected int position = 0;
...
@Override
public boolean incrementToken() throws IOException {
if (position >= tokens.size()) {
return false;
} else {
String token = tokens.get(position);
termAtt.setEmpty().append(token, stringToTokenize.indexOf(token), token.length());
position++;
return true;
}
}
final char[] buffer = new char[8192];
private void fillBuffer(Reader input) throws IOException {
int len;
StringBuilder str = new StringBuilder();
str.setLength(0);
while ((len = input.read(buffer)) > 0) {
str.append(buffer, 0, len);
}
stringToTokenize = str.toString();
tokens = splitUrl(stringToTokenize); //Method to split into segmentation permutations
}
}
Can anyone tell me what I'm doing wrong?
Thanks for any help at all!