zouxiang
(zouxiang)
August 6, 2018, 11:54am
1
Hello, I'm trying to find a token filter that can combine all the tokens into one.
For example:
the text "bags and shoes" ==> 3 tokens: "bags" "and" "shoes" (use StandardTokenizer)
"bags" "and" "shoes" ==> "bag" "and" "shoe" (use Porter Stem Token Filter)
then, is there a token filter which can combine "bag", "and", "shoe" into one token: "bag and shoe"?
Or is there any way to analysis the text "bags and shoes" and get a keyword result "bag and shoe"?
1 Like
dadoonet
(David Pilato)
August 6, 2018, 12:05pm
2
zouxiang
(zouxiang)
August 6, 2018, 12:43pm
3
Thanks.
I try to use the shingle filter like this
GET _analyze
{
"text": ["bags and shoes"],
"tokenizer": "standard",
"filter": [
"porter_stem",
{
"type": "shingle",
"output_unigrams": false,
"min_shingle_size": 3,
"max_shingle_size": 3
}
]
}
and get the result :
{
"tokens": [
{
"token": "bag and shoe",
"start_offset": 0,
"end_offset": 14,
"type": "shingle",
"position": 0
}
]
}
This is the result I want. But this works only when
min_shingle_size == max_shingle_size == token size
The token size is indeterminate and may differs from each other, so I can't determine the value of min_shingle_size and max_shingle_size
dadoonet
(David Pilato)
August 6, 2018, 12:45pm
4
Why not indexing then the full content as one single token with a "keyword" type for example and add a subfield which index every single term alone?
See https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html
zouxiang
(zouxiang)
August 6, 2018, 12:54pm
5
If index with a "keyword" type, the porter_stem filter may not work correctly
GET _analyze
{
"text": ["bags and shoes"],
"tokenizer": "keyword",
"filter": [
"porter_stem"
]
}
the result:
{
"tokens": [
{
"token": "bags and sho",
"start_offset": 0,
"end_offset": 14,
"type": "word",
"position": 0
}
]
}
the token is "bags and sho" , not "bag and shoe"
dadoonet
(David Pilato)
August 6, 2018, 1:38pm
6
Yes. That's true.
BTW why do you want to do this?
I'm trying to find a token filter that can combine all the tokens into one.
Why not using a phrase search?
zouxiang
(zouxiang)
August 6, 2018, 2:21pm
7
Because I want to match all words in this field, but not part of it.
The document "bags and shoes" should return only when I search "bags and shoes" or "bag and shoe" ,
and not return when I search "bag" or "bag and" or "and shoe"
dadoonet
(David Pilato)
August 6, 2018, 2:50pm
8
And if there is a document like "bags and shoes and whatever" it should not be returned either when you search for "bags and shoes", right?
dadoonet
(David Pilato)
August 6, 2018, 3:06pm
10
So I don't know. The problem is that you also want to apply a stemmer to all terms.
May be @jpountz has an idea?
zouxiang
(zouxiang)
August 6, 2018, 3:17pm
11
Thank you anyway. Maybe I should add a plugin and implement a custom token filter.
1 Like
system
(system)
Closed
September 3, 2018, 3:18pm
12
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.