AsciiFolding: Can't make it work with ingest-attachment

Hi ES community,
I have a problem having asciifolder working with ingest-attachment pipeline.
Here I provide the whole code (to avoid hidden mistake :wink: ).
You can see that I use lowercase and my own stopwords filters. Those filters are working fine, but asciifolder not. The last command search a word without diacritic that should match.

#delete old pipeline and index
curl -X DELETE "localhost:9200/_ingest/pipeline/doctestpl"
curl -X DELETE "localhost:9200/doctest"


# create my customized index
curl -X PUT "localhost:9200/doctest" -H 'Content-Type: application/json' -d'
{
	"settings": {
		"index": {
			"number_of_shards": 20,
			"number_of_replicas": 1
		},
		"analysis": {
			"filter": {
				"general_stop_words": {
					"type": "stop",
					"stopwords": ["a", "an", "and", "as", "at", "be", "but", "by", "for", "had", "has", "have", "he", "her", "him", "his", "how", "i", "if",
						"in", "is", "it", "me", "my", "no", "of", "on", "or", "so", "some", "such", "than", "that", "the", "then", "these", "this",
						"those", "to", "we", "who", "''s",
						"alors", "au", "aussi", "avec", "car", "ce", "c''", "cela", "de", "dont", "ces", "ci", "comme", "dans", "des", "du", "donc",
						"elle", "elles", "en", "est", "et", "eu", "il", "ils", "je", "la", "le", "les", "leur", "ma", "mais", "mes", "mon", "meme", "ni", "nous", "ou", "on", "or",
						"par", "pas", "pour", "puis", "que", "qui", "sa", "ses", "si", "son", "sur", "ta", "tes", "ton", "tous", "tout", "tres", "tu",
						"votre", "vous", "vu", "ca", "ete", "etre", "y"]
				}
			},
			"tokenizer": {
				"test_tokenizer": {
					"type": "pattern",
					"pattern": "(?i)([a-z0-9._%+-]+@[a-z0-9.-]+\\.[a-z]{2,6}|c\\+\\+|c#|j\\+\\+|f#|x\\+\\+|c--|j#|d\\+\\+|go!|c/al|[a-z0-9]+)",
					"group": 1
				}
			},
			"analyzer": {
				"test_analyzer": {
					"filter": ["asciifolding", "lowercase", "general_stop_words"],
					"tokenizer": "test_tokenizer"
				}
			}
		}
	},
	"mappings": {
		"document": {
			"properties": {
				"attachment": {
					"properties": {
						"content": {
							"type": "text",
							"analyzer": "test_analyzer",
							"fields": {
								"keyword": {
									"type": "keyword",
									"ignore_above": 256
								}
							}
						}
					}
				},
				"filename": {
					"type": "text"
				},
				"docid": {
					"type": "text"
				},
				"insertdate": {
					"type": "date"
				},
				"islastversion": {
					"type": "boolean"
				},
				"docversion": {
					"type": "long"
				},
				"downloadcount": {
					"type": "long"
				}
			}
		}
	}
}'


# create pipeline for test docs
curl -X PUT "localhost:9200/_ingest/pipeline/doctestpl" -H 'Content-Type: application/json' -d'
{
	"description": "Extract attachment information dedicated to test",
	"processors": [{
			"attachment": {
				"field": "payload",
				"indexed_chars": "-1"
			}
		}
	]
}'

# inject the sentence.
curl -X PUT "localhost:9200/doctest/document/doc2?pipeline=doctestpl" -H 'Content-Type: application/json' -d'
{
	"payload": "TGUgcMOpbGVyaW4gcMOqY2hlIHNvbiBwb2lzc29uIGF1eCBBw6dvcmVzLg==",
	"filename": "kiki.txt",
	"docid": "010234",
	"docversion": "1",
	"insertdate": "2019-01-25T17:26:00Z",
	"downloadcount": "0",
	"islastversion": "true"
}'

# search word peche: Nothing returned
curl -X GET "localhost:9200/doctest/_search" -H 'Content-Type: application/json' -d'
{
	"query": {
		"constant_score": {
			"filter":{
				"bool": {
					"filter":{
						"term": {"islastversion": true}
					}
					,
					"must": [
						{
							"match":{
								"attachment.content":"peche"
							}
						}
					]	
				}
			}
		}
	}
}'

I don't know what's going on with that pattern tokenizer, but that's the cause of your problems. It gives me a headache just looking at that regular expression :wink:

You can see it's the tokenizer causing problems by using the _analyze API:

GET doctest/_analyze
{
  "analyzer": "test_analyzer",
  "text": "pêche"
}

Somehow that tokenizer breaks up the word pêche into p and che. But it doesn't do the same with the word peche from your match query (without diacritics):

GET doctest/_analyze
{
  "analyzer": "test_analyzer",
  "text": "peche"
}

That's why the query is not a match for the document.

I'm not sure what you're trying to do with that tokenizer? Do you really need it? Maybe you can switch to the standard tokenizer instead?

1 Like

Wow! thank you for this fine spot :pray: !!

The tokenizer is here to allow some technical words (programming languages mainly) to not be destroyed by standard tokenizer :sweat_smile:.

Example: in standard mode, looking for "c#" results in "c", and "c++" results in ... "c" also. So if i'm looking for someone doing c# ... I get every c* !!

So, the regex is not so complicated, as it keeps only any word with numbers (with no diacritics, here is the pb), and some specific programming langage patterns. But I thought that tokenizer was applied after asciifolding ... :roll_eyes:

Note that before using asciifolding, I had a char filter that worked fine, but that i wanted to be replaced by asciifolder.

So, I will review my job.
thanks again!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.