Issues using type_table in a word_delimiter token filter to NOT split on special character

Jacob_Evans · August 6, 2012, 10:44pm

I'm trying to design a token filter that will split words, but will not
split on dollar signs ("$"). (I'm actually trying to do something more
complicated, but I am stuck on this step).

I'm attempting to use a word_delimiter filter to do this. The "type_table"
item in the JSON configuration does not seem to be honored by
elasticsearch. Below is how I am creating the index:

curl -X POST http://localhost:9200/venues -d '{
"mappings": {
"venue": {
"properties": {
"id": {
"type": "string"
},
"name": {
"type": "string",
"analyzer": "name_analyzer"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"name_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"split_words_except_dollar"
]
}
},
"filter": {
"split_words_except_dollar": {
"type": "word_delimiter",
"type_table": {
"$": "ALPHANUM"
}
}
}
}
}
}'

(note, I also have tried it with type_table being an array of objects,
rather than just an object, with the same result.
I am testing it with the following call:

curl -XGET
'localhost:9200/venues/_analyze?pretty=true&analyzer=name_analyzer' -d
'ke$ha'

I expect the output of the analyzer to be "ke$ha", but instead it emits
two tokens: "ke" & "ha".

What am I doing wrong here? I can't find an example anywhere of using a
custom type_table, and the correct syntax.

Thank you,
Jacob

Jacob_Evans · August 6, 2012, 11:18pm

Okay, so after digging through the ES source, I realize my approach was
slightly off. I need to define my type_table as an array of strings. I've
done this now (see below), but I'm still getting the same results.

Note that I am now confident I'm on the right track, because if my
type_table[0] string is formatted wrong, ie "$ => SOMEINVALIDTYPE", ES
throws a runtime error (from parseTypes() on line 114 of
src/main/java/org/elasticsearch/index/analysis/WordDelimiterTokeFilterFactory.java)
when I try to create the index, but no error is thrown with the config
pasted below.

curl -X POST http://localhost:9200/venues -d '{
"mappings": {
"venue": {
"properties": {
"id": {
"type": "string"
},
"name": {
"type": "string",
"analyzer": "name_analyzer"
},
"bounds": {
"type": "string"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"name_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"split_words_except_dollar"
]
}
},
"filter": {
"split_words_except_dollar": {
"type": "word_delimiter",
"type_table": [
"$ => LOWER"
]
}
}
}
}
}'

So, I'm closer but still stuck. Any ideas?

On Monday, August 6, 2012 3:44:52 PM UTC-7, Jacob Evans wrote:

I'm trying to design a token filter that will split words, but will not
split on dollar signs ("$"). (I'm actually trying to do something more
complicated, but I am stuck on this step).

I'm attempting to use a word_delimiter filter to do this. The
"type_table" item in the JSON configuration does not seem to be honored by
elasticsearch. Below is how I am creating the index:

curl -X POST http://localhost:9200/venues -d '{
"mappings": {
"venue": {
"properties": {
"id": {
"type": "string"
},
"name": {
"type": "string",
"analyzer": "name_analyzer"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"name_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"split_words_except_dollar"
]
}
},
"filter": {
"split_words_except_dollar": {
"type": "word_delimiter",
"type_table": {
"$": "ALPHANUM"
}
}
}
}
}
}'

(note, I also have tried it with type_table being an array of objects,
rather than just an object, with the same result.
I am testing it with the following call:

curl -XGET
'localhost:9200/venues/_analyze?pretty=true&analyzer=name_analyzer' -d
'ke$ha'

I expect the output of the analyzer to be "ke$ha", but instead it emits
two tokens: "ke" & "ha".

What am I doing wrong here? I can't find an example anywhere of using a
custom type_table, and the correct syntax.

Thank you,
Jacob

Ivan · August 6, 2012, 11:50pm

Have you looked into a pattern tokenizer?

For your simple case, you could capture only word characters and dollar signs:

index :
analysis :
tokenizer :
customtokenizer :
type : pattern
group : 0
pattern : '(\w$])'

--
Ivan

On Mon, Aug 6, 2012 at 4:18 PM, Jacob Evans incredible@gmail.com wrote:

Okay, so after digging through the ES source, I realize my approach was
slightly off. I need to define my type_table as an array of strings. I've
done this now (see below), but I'm still getting the same results.

Note that I am now confident I'm on the right track, because if my
type_table[0] string is formatted wrong, ie "$ => SOMEINVALIDTYPE", ES
throws a runtime error (from parseTypes() on line 114 of
src/main/java/org/elasticsearch/index/analysis/WordDelimiterTokeFilterFactory.java)
when I try to create the index, but no error is thrown with the config
pasted below.

curl -X POST http://localhost:9200/venues -d '{
"mappings": {
"venue": {
"properties": {
"id": {
"type": "string"
},
"name": {
"type": "string",
"analyzer": "name_analyzer"
},
"bounds": {
"type": "string"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"name_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"split_words_except_dollar"
]
}
},
"filter": {
"split_words_except_dollar": {
"type": "word_delimiter",
"type_table": [
"$ => LOWER"
]
}
}
}
}
}'

So, I'm closer but still stuck. Any ideas?

On Monday, August 6, 2012 3:44:52 PM UTC-7, Jacob Evans wrote:

I'm trying to design a token filter that will split words, but will not
split on dollar signs ("$"). (I'm actually trying to do something more
complicated, but I am stuck on this step).

I'm attempting to use a word_delimiter filter to do this. The
"type_table" item in the JSON configuration does not seem to be honored by
elasticsearch. Below is how I am creating the index:

curl -X POST http://localhost:9200/venues -d '{
"mappings": {
"venue": {
"properties": {
"id": {
"type": "string"
},
"name": {
"type": "string",
"analyzer": "name_analyzer"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"name_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"split_words_except_dollar"
]
}
},
"filter": {
"split_words_except_dollar": {
"type": "word_delimiter",
"type_table": {
"$": "ALPHANUM"
}
}
}
}
}
}'

(note, I also have tried it with type_table being an array of objects,
rather than just an object, with the same result.
I am testing it with the following call:

curl -XGET
'localhost:9200/venues/_analyze?pretty=true&analyzer=name_analyzer' -d
'ke$ha'

I expect the output of the analyzer to be "ke$ha", but instead it emits
two tokens: "ke" & "ha".

What am I doing wrong here? I can't find an example anywhere of using a
custom type_table, and the correct syntax.

Thank you,
Jacob

Jacob_Evans · August 7, 2012, 12:03am

Yes, I have. The reason I can't use it (and maybe I should have been more
clear) is that my example posted above is a bit simpler than my actual
scenario.

In actuality, this filter will come after a DIFFERENT word_delimiter token,
which is concatenating all of the input, but also has
preserve_original=true.

I'm not really trying to split on '$', it was just to make the example
easier to understand. I'm really splitting on ' ' (spaces).

So, basically I want to have tokens for both the entire concatenated
string, as well as split by spaces. That is, given the input "ke$ha is
bad", I want the following tokens: "kehaisbad" ($ removed by first w_d
filter), "ke$ha", "is", "bad"

And, unfortunately there's no normal whitespace tokenfilter that's
analogous to the whitespace tokenizer. Also the pattern_replace token
filter doesn't have the ability to SPLIT tokens like the pattern tokenizer
(it only can search/replace).

Make sense?

On Monday, August 6, 2012 4:50:21 PM UTC-7, Ivan Brusic wrote:

Have you looked into a pattern tokenizer?

Elasticsearch Platform — Find real-time answers at scale | Elastic

For your simple case, you could capture only word characters and dollar
signs:

index :
analysis :
tokenizer :
customtokenizer :
type : pattern
group : 0
pattern : '(\w$])'

--
Ivan

On Mon, Aug 6, 2012 at 4:18 PM, Jacob Evans incredible@gmail.com wrote:

Okay, so after digging through the ES source, I realize my approach was
slightly off. I need to define my type_table as an array of strings.
I've
done this now (see below), but I'm still getting the same results.

Note that I am now confident I'm on the right track, because if my
type_table[0] string is formatted wrong, ie "$ => SOMEINVALIDTYPE", ES
throws a runtime error (from parseTypes() on line 114 of

src/main/java/org/elasticsearch/index/analysis/WordDelimiterTokeFilterFactory.java)

when I try to create the index, but no error is thrown with the config
pasted below.

curl -X POST http://localhost:9200/venues -d '{
"mappings": {
"venue": {
"properties": {
"id": {
"type": "string"
},
"name": {
"type": "string",
"analyzer": "name_analyzer"
},
"bounds": {
"type": "string"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"name_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"split_words_except_dollar"
]
}
},
"filter": {
"split_words_except_dollar": {
"type": "word_delimiter",
"type_table": [
"$ => LOWER"
]
}
}
}
}
}'

So, I'm closer but still stuck. Any ideas?

On Monday, August 6, 2012 3:44:52 PM UTC-7, Jacob Evans wrote:

I'm trying to design a token filter that will split words, but will not
split on dollar signs ("$"). (I'm actually trying to do something more
complicated, but I am stuck on this step).

I'm attempting to use a word_delimiter filter to do this. The
"type_table" item in the JSON configuration does not seem to be honored
by
elasticsearch. Below is how I am creating the index:

curl -X POST http://localhost:9200/venues -d '{
"mappings": {
"venue": {
"properties": {
"id": {
"type": "string"
},
"name": {
"type": "string",
"analyzer": "name_analyzer"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"name_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"split_words_except_dollar"
]
}
},
"filter": {
"split_words_except_dollar": {
"type": "word_delimiter",
"type_table": {
"$": "ALPHANUM"
}
}
}
}
}
}'

(note, I also have tried it with type_table being an array of objects,
rather than just an object, with the same result.
I am testing it with the following call:

curl -XGET
'localhost:9200/venues/_analyze?pretty=true&analyzer=name_analyzer' -d
'ke$ha'

I expect the output of the analyzer to be "ke$ha", but instead it
emits
two tokens: "ke" & "ha".

What am I doing wrong here? I can't find an example anywhere of using
a
custom type_table, and the correct syntax.

Thank you,
Jacob

Jacob_Evans · August 7, 2012, 12:09am

FYI, I actually think that this is a
bug: word_delimiter token filter does not honor "type_table" option. · Issue #2145 · elastic/elasticsearch · GitHub

On Monday, August 6, 2012 3:44:52 PM UTC-7, Jacob Evans wrote:

I'm trying to design a token filter that will split words, but will not
split on dollar signs ("$"). (I'm actually trying to do something more
complicated, but I am stuck on this step).

I'm attempting to use a word_delimiter filter to do this. The
"type_table" item in the JSON configuration does not seem to be honored by
elasticsearch. Below is how I am creating the index:

curl -X POST http://localhost:9200/venues -d '{
"mappings": {
"venue": {
"properties": {
"id": {
"type": "string"
},
"name": {
"type": "string",
"analyzer": "name_analyzer"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"name_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"split_words_except_dollar"
]
}
},
"filter": {
"split_words_except_dollar": {
"type": "word_delimiter",
"type_table": {
"$": "ALPHANUM"
}
}
}
}
}
}'

(note, I also have tried it with type_table being an array of objects,
rather than just an object, with the same result.
I am testing it with the following call:

curl -XGET
'localhost:9200/venues/_analyze?pretty=true&analyzer=name_analyzer' -d
'ke$ha'

I expect the output of the analyzer to be "ke$ha", but instead it emits
two tokens: "ke" & "ha".

What am I doing wrong here? I can't find an example anywhere of using a
custom type_table, and the correct syntax.

Thank you,
Jacob