AsciiFolding: Can't make it work with ingest-attachment

Hi ES community,
I have a problem having asciifolder working with ingest-attachment pipeline.
Here I provide the whole code (to avoid hidden mistake :wink: ).
You can see that I use lowercase and my own stopwords filters. Those filters are working fine, but asciifolder not. The last command search a word without diacritic that should match.

#delete old pipeline and index
curl -X DELETE "localhost:9200/_ingest/pipeline/doctestpl"
curl -X DELETE "localhost:9200/doctest"


# create my customized index
curl -X PUT "localhost:9200/doctest" -H 'Content-Type: application/json' -d'
{
	"settings": {
		"index": {
			"number_of_shards": 20,
			"number_of_replicas": 1
		},
		"analysis": {
			"filter": {
				"general_stop_words": {
					"type": "stop",
					"stopwords": ["a", "an", "and", "as", "at", "be", "but", "by", "for", "had", "has", "have", "he", "her", "him", "his", "how", "i", "if",
						"in", "is", "it", "me", "my", "no", "of", "on", "or", "so", "some", "such", "than", "that", "the", "then", "these", "this",
						"those", "to", "we", "who", "''s",
						"alors", "au", "aussi", "avec", "car", "ce", "c''", "cela", "de", "dont", "ces", "ci", "comme", "dans", "des", "du", "donc",
						"elle", "elles", "en", "est", "et", "eu", "il", "ils", "je", "la", "le", "les", "leur", "ma", "mais", "mes", "mon", "meme", "ni", "nous", "ou", "on", "or",
						"par", "pas", "pour", "puis", "que", "qui", "sa", "ses", "si", "son", "sur", "ta", "tes", "ton", "tous", "tout", "tres", "tu",
						"votre", "vous", "vu", "ca", "ete", "etre", "y"]
				}
			},
			"tokenizer": {
				"test_tokenizer": {
					"type": "pattern",
					"pattern": "(?i)([a-z0-9._%+-]+@[a-z0-9.-]+\\.[a-z]{2,6}|c\\+\\+|c#|j\\+\\+|f#|x\\+\\+|c--|j#|d\\+\\+|go!|c/al|[a-z0-9]+)",
					"group": 1
				}
			},
			"analyzer": {
				"test_analyzer": {
					"filter": ["asciifolding", "lowercase", "general_stop_words"],
					"tokenizer": "test_tokenizer"
				}
			}
		}
	},
	"mappings": {
		"document": {
			"properties": {
				"attachment": {
					"properties": {
						"content": {
							"type": "text",
							"analyzer": "test_analyzer",
							"fields": {
								"keyword": {
									"type": "keyword",
									"ignore_above": 256
								}
							}
						}
					}
				},
				"filename": {
					"type": "text"
				},
				"docid": {
					"type": "text"
				},
				"insertdate": {
					"type": "date"
				},
				"islastversion": {
					"type": "boolean"
				},
				"docversion": {
					"type": "long"
				},
				"downloadcount": {
					"type": "long"
				}
			}
		}
	}
}'


# create pipeline for test docs
curl -X PUT "localhost:9200/_ingest/pipeline/doctestpl" -H 'Content-Type: application/json' -d'
{
	"description": "Extract attachment information dedicated to test",
	"processors": [{
			"attachment": {
				"field": "payload",
				"indexed_chars": "-1"
			}
		}
	]
}'

# inject the sentence.
curl -X PUT "localhost:9200/doctest/document/doc2?pipeline=doctestpl" -H 'Content-Type: application/json' -d'
{
	"payload": "TGUgcMOpbGVyaW4gcMOqY2hlIHNvbiBwb2lzc29uIGF1eCBBw6dvcmVzLg==",
	"filename": "kiki.txt",
	"docid": "010234",
	"docversion": "1",
	"insertdate": "2019-01-25T17:26:00Z",
	"downloadcount": "0",
	"islastversion": "true"
}'

# search word peche: Nothing returned
curl -X GET "localhost:9200/doctest/_search" -H 'Content-Type: application/json' -d'
{
	"query": {
		"constant_score": {
			"filter":{
				"bool": {
					"filter":{
						"term": {"islastversion": true}
					}
					,
					"must": [
						{
							"match":{
								"attachment.content":"peche"
							}
						}
					]	
				}
			}
		}
	}
}'

I don't know what's going on with that pattern tokenizer, but that's the cause of your problems. It gives me a headache just looking at that regular expression :wink:

You can see it's the tokenizer causing problems by using the _analyze API:

GET doctest/_analyze
{
  "analyzer": "test_analyzer",
  "text": "pêche"
}

Somehow that tokenizer breaks up the word pêche into p and che. But it doesn't do the same with the word peche from your match query (without diacritics):

GET doctest/_analyze
{
  "analyzer": "test_analyzer",
  "text": "peche"
}

That's why the query is not a match for the document.

I'm not sure what you're trying to do with that tokenizer? Do you really need it? Maybe you can switch to the standard tokenizer instead?

Wow! thank you for this fine spot :pray: !!

The tokenizer is here to allow some technical words (programming languages mainly) to not be destroyed by standard tokenizer :sweat_smile:.

Example: in standard mode, looking for "c#" results in "c", and "c++" results in ... "c" also. So if i'm looking for someone doing c# ... I get every c* !!

So, the regex is not so complicated, as it keeps only any word with numbers (with no diacritics, here is the pb), and some specific programming langage patterns. But I thought that tokenizer was applied after asciifolding ... :roll_eyes:

Note that before using asciifolding, I had a char filter that worked fine, but that i wanted to be replaced by asciifolder.

So, I will review my job.
thanks again!