How to combine all tokens into one?

Hello, I'm trying to find a token filter that can combine all the tokens into one.

For example:

  1. the text "bags and shoes" ==> 3 tokens: "bags" "and" "shoes" (use StandardTokenizer)
  2. "bags" "and" "shoes" ==> "bag" "and" "shoe" (use Porter Stem Token Filter)

then, is there a token filter which can combine "bag", "and", "shoe" into one token: "bag and shoe"?

Or is there any way to analysis the text "bags and shoes" and get a keyword result "bag and shoe"?

1 Like

Have a look at https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-shingle-tokenfilter.html

Might be what you are looking for.

Thanks.
I try to use the shingle filter like this

GET _analyze
{
  "text": ["bags and shoes"],
  "tokenizer": "standard",
  "filter": [
    "porter_stem",
    {
      "type": "shingle",
      "output_unigrams": false,
      "min_shingle_size": 3,
      "max_shingle_size": 3
    }
  ]
}

and get the result :

{
  "tokens": [
    {
      "token": "bag and shoe",
      "start_offset": 0,
      "end_offset": 14,
      "type": "shingle",
      "position": 0
    }
  ]
}

This is the result I want. But this works only when

min_shingle_size == max_shingle_size == token size 

The token size is indeterminate and may differs from each other, so I can't determine the value of min_shingle_size and max_shingle_size

Why not indexing then the full content as one single token with a "keyword" type for example and add a subfield which index every single term alone?

See https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html

If index with a "keyword" type, the porter_stem filter may not work correctly

GET _analyze
{
  "text": ["bags and shoes"],
  "tokenizer": "keyword",
  "filter": [
    "porter_stem"
  ]
}

the result:

{
  "tokens": [
    {
      "token": "bags and sho",
      "start_offset": 0,
      "end_offset": 14,
      "type": "word",
      "position": 0
    }
  ]
} 

the token is "bags and sho" , not "bag and shoe"

Yes. That's true.

BTW why do you want to do this?

I'm trying to find a token filter that can combine all the tokens into one.

Why not using a phrase search?

Because I want to match all words in this field, but not part of it.
The document "bags and shoes" should return only when I search "bags and shoes" or "bag and shoe",
and not return when I search "bag" or "bag and" or "and shoe"

And if there is a document like "bags and shoes and whatever" it should not be returned either when you search for "bags and shoes", right?

Yes.

So I don't know. The problem is that you also want to apply a stemmer to all terms.

May be @jpountz has an idea?

Thank you anyway. Maybe I should add a plugin and implement a custom token filter.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.