Can we do : Analyser->Tokenizer->Token Filter->Re-tokenize and considers only these last tokens

Hello,

I want, given the input text :

{
  "analyzer": "parapheur_shingle",
  "text":     "W 4.8.4.1 NI FNA NP 4.8.4 TEST 2"
}

having fllowing steps :

  1. tokenize as standard (word break with spaces => 8 tokens)
  2. filter each token ("lowercase", "pattern_replace"). "pattern_replace" replaces "." to " ".
  3. so we obtain the token "4 8 4 1" after applying pattern_replace to original token "4.8.4.1"
  4. Re-do 1) => 8 + 5 more tokens
  5. filter each token with shingle filter (this filter relies on generated token, so output will differ in comparison between after 3) or here in 5)).
  6. End of token generation.

Thank you.

I managed to do what I want with "simple_pattern_split" tokenizer :

{
  "index": {
    "max_ngram_diff": 50,
    "analysis": {
      "analyzer": {
        "parapheur_shingle_new": {
          "tokenizer": "pattern_split_new",
          "filter": [ "shingle" ]
        }
      },
	  "tokenizer": {
	    "pattern_split_new": {
	      "type": "simple_pattern_split",
	      "pattern": "[\\s+ \\.]"
	    }
	  }
    }
  }
}

But I'd like the split pattern to perform the same split that "standard" tokenization, instead of my hard coded pattern.

Hi @fraf

Do you want the tokens that way?

Yes. And even more token. I want max_shingle_size to be infinite ; that is to say ; the longest token is the input text (with dot replaced by space).
But it should be possible given this error message from server :

In Shingle TokenFilter the difference between max_shingle_size and min_shingle_size (and +1 if outputting unigrams) must be less than or equal to: [3] but was [19]. This limit can be set by changing the [index.max_shingle_diff] index level setting.

Thx.

This is the analyzer that I used:

PUT teste
{
  "settings": {
    "analysis": {
      "analyzer": {
        "parapheur_shingle_new": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_filter",
            "shingle_filter"
          ]
        }
      },
      "filter": {
        "my_filter":{
          "type": "pattern_replace",
          "pattern": """\.""",
          "replacement": " "
        },
        "shingle_filter": {
          "type": "shingle",
          "min_shingle_size": 2,
          "max_shingle_size": 4
        }
      }
    }
  }
}

GET teste/_analyze
{
  "analyzer": "parapheur_shingle_new",
  "text":     "W 4.8.4.1 NI FNA NP 4.8.4 TEST 2"
}

Thx, but be careful that's not exactly the same.

Given the input text "NP 4.8.1", you won't be able to match the input search "NP 4".
With standard tokenizer, you obtain 2 tokens : "NP" and "4.8.1".
On these two tokens, you apply your filters :
"NP => unchanged => "NP"
"4.8.1" => myfilter => "4 8 1" processed token
shingle_filter => it shingle two adjacent tokens ; so it generates the new token : "NP 4 8 1".

So you have three tokens in all : "NP", "4 8 1" and "NP 4 8 1"...... missing "NP 4" token ! So no matches.
With shingle_filter you have to operates on tokenizer, whatever the filter chain is. So I was using the "simple_pattern_split" tokenizer. But I need to enumerate word breaks characters.

Complicated, because even increasing the shingle limit you will get more tokens in addition to the 4 you want and I don't know if that's what you need.

I agree with you, but certainly much less than min_gram and max_gram parameters whose NGram tokenizer relies on. Cutting a phrase into words, generates much less token than cutting into characters for sure.
The dilemma is : how much word the user can enter in its input ? It follows then directly the param value max_shingle_size