Remove duplicate tokens after edge_ngram on array

I have a field where multiple values are stored:

field: ["testOneTwo", "testThreeFour"]

I would like to analyze this field with an edge_ngram filter, but also remove duplicate tokens. I tried the unique and remove_duplicates filter.
Example Settings:

{
        "settings": {
            "analysis": {
                "filter": {
                    "edgengram_filter": {
                        "type": "edgeNGram",
                        "min_gram": 1,
                        "max_gram": 24
                    }
                },
                "tokenizer": {
                    "edgengram": {
                        "type": "edge_ngram",
                        "min_gram": 1,
                        "max_gram": 24,
                        "token_chars": [
                            "letter",
                            "digit"
                        ]
                    }
                },
                "analyzer": {
                    "testunique": {
                        "type": "custom",
                         "tokenizer": "standard",
                        "filter": [
                            "edgengram_filter",
                            "unique"
                        ]
                    },
                    "testremove": {
                        "type": "custom",
                        "tokenizer": "standard",
                        "filter": [
                            "edgengram_filter",
                            "remove_duplicates"
                        ]
                    },
                    "testedgeunique": {
                        "tokenizer": "edgengram",
                        "filter": [
                            "unique"
                        ]
                    },
                    "testedgeremove": {
                        "tokenizer": "edgengram",
                        "filter": [
                            "remove_duplicates"
                        ]
                    }
                }
            }
        }
    }

For each analyzer the _analyze API shows the tokens t, te, tes, test two times.

When having a field where both values are stored in a single value like "testOneTwo testThreeFour" it works. But this is not a solution for me as I use copy_to and edge_ngram as a token filter.

Any way to enforce this? Thanks!

How about using join processor of ingest pipeline? You can concatenate the strings and it will work as "testOneTwo testThreeFour".

Sadly, this wont be working when I use copy_to. So I guess I have to join them before indexing...

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.