Can we do : Analyser->Tokenizer->Token Filter->Re-tokenize and considers only these last tokens

fraf · June 15, 2022, 3:26pm

Hello,

I want, given the input text :

{
  "analyzer": "parapheur_shingle",
  "text":     "W 4.8.4.1 NI FNA NP 4.8.4 TEST 2"
}

having fllowing steps :

tokenize as standard (word break with spaces => 8 tokens)
filter each token ("lowercase", "pattern_replace"). "pattern_replace" replaces "." to " ".
so we obtain the token "4 8 4 1" after applying pattern_replace to original token "4.8.4.1"
Re-do 1) => 8 + 5 more tokens
filter each token with shingle filter (this filter relies on generated token, so output will differ in comparison between after 3) or here in 5)).
End of token generation.

Thank you.

fraf · June 15, 2022, 3:45pm

I managed to do what I want with "simple_pattern_split" tokenizer :

{
  "index": {
    "max_ngram_diff": 50,
    "analysis": {
      "analyzer": {
        "parapheur_shingle_new": {
          "tokenizer": "pattern_split_new",
          "filter": [ "shingle" ]
        }
      },
	  "tokenizer": {
	    "pattern_split_new": {
	      "type": "simple_pattern_split",
	      "pattern": "[\\s+ \\.]"
	    }
	  }
    }
  }
}

But I'd like the split pattern to perform the same split that "standard" tokenization, instead of my hard coded pattern.

RabBit_BR · June 15, 2022, 6:23pm

Hi @fraf

Do you want the tokens that way?

fraf · June 16, 2022, 7:35am

Yes. And even more token. I want max_shingle_size to be infinite ; that is to say ; the longest token is the input text (with dot replaced by space).
But it should be possible given this error message from server :

In Shingle TokenFilter the difference between max_shingle_size and min_shingle_size (and +1 if outputting unigrams) must be less than or equal to: [3] but was [19]. This limit can be set by changing the [index.max_shingle_diff] index level setting.

Thx.

RabBit_BR · June 16, 2022, 12:38pm

This is the analyzer that I used:

PUT teste
{
  "settings": {
    "analysis": {
      "analyzer": {
        "parapheur_shingle_new": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_filter",
            "shingle_filter"
          ]
        }
      },
      "filter": {
        "my_filter":{
          "type": "pattern_replace",
          "pattern": """\.""",
          "replacement": " "
        },
        "shingle_filter": {
          "type": "shingle",
          "min_shingle_size": 2,
          "max_shingle_size": 4
        }
      }
    }
  }
}

GET teste/_analyze
{
  "analyzer": "parapheur_shingle_new",
  "text":     "W 4.8.4.1 NI FNA NP 4.8.4 TEST 2"
}

fraf · June 16, 2022, 1:17pm

Thx, but be careful that's not exactly the same.

Given the input text "NP 4.8.1", you won't be able to match the input search "NP 4".
With standard tokenizer, you obtain 2 tokens : "NP" and "4.8.1".
On these two tokens, you apply your filters :
"NP => unchanged => "NP"
"4.8.1" => myfilter => "4 8 1" processed token
shingle_filter => it shingle two adjacent tokens ; so it generates the new token : "NP 4 8 1".

So you have three tokens in all : "NP", "4 8 1" and "NP 4 8 1"...... missing "NP 4" token ! So no matches.
With shingle_filter you have to operates on tokenizer, whatever the filter chain is. So I was using the "simple_pattern_split" tokenizer. But I need to enumerate word breaks characters.

RabBit_BR · June 16, 2022, 1:53pm

Complicated, because even increasing the shingle limit you will get more tokens in addition to the 4 you want and I don't know if that's what you need.

fraf · June 16, 2022, 2:36pm

I agree with you, but certainly much less than min_gram and max_gram parameters whose NGram tokenizer relies on. Cutting a phrase into words, generates much less token than cutting into characters for sure.
The dilemma is : how much word the user can enter in its input ? It follows then directly the param value max_shingle_size

system · July 14, 2022, 2:37pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Shingle token Elasticsearch	2	634	July 5, 2017
How to combine all tokens into one? Elasticsearch	11	2715	September 3, 2018
Issue when combining shingle filter and stopwords Elasticsearch	2	963	July 5, 2017
Issue with Shingles and Stopwords Elasticsearch	2	1063	December 19, 2018
Shingle filter to allow mismatching spaces Elasticsearch	5	1538	November 30, 2020

Can we do : Analyser->Tokenizer->Token Filter->Re-tokenize and considers only these last tokens

Related topics