Remove duplicate tokens after edge_ngram on array

timon · January 20, 2022, 11:30am

I have a field where multiple values are stored:

field: ["testOneTwo", "testThreeFour"]

I would like to analyze this field with an edge_ngram filter, but also remove duplicate tokens. I tried the unique and remove_duplicates filter.
Example Settings:

{
        "settings": {
            "analysis": {
                "filter": {
                    "edgengram_filter": {
                        "type": "edgeNGram",
                        "min_gram": 1,
                        "max_gram": 24
                    }
                },
                "tokenizer": {
                    "edgengram": {
                        "type": "edge_ngram",
                        "min_gram": 1,
                        "max_gram": 24,
                        "token_chars": [
                            "letter",
                            "digit"
                        ]
                    }
                },
                "analyzer": {
                    "testunique": {
                        "type": "custom",
                         "tokenizer": "standard",
                        "filter": [
                            "edgengram_filter",
                            "unique"
                        ]
                    },
                    "testremove": {
                        "type": "custom",
                        "tokenizer": "standard",
                        "filter": [
                            "edgengram_filter",
                            "remove_duplicates"
                        ]
                    },
                    "testedgeunique": {
                        "tokenizer": "edgengram",
                        "filter": [
                            "unique"
                        ]
                    },
                    "testedgeremove": {
                        "tokenizer": "edgengram",
                        "filter": [
                            "remove_duplicates"
                        ]
                    }
                }
            }
        }
    }

For each analyzer the _analyze API shows the tokens t, te, tes, test two times.

When having a field where both values are stored in a single value like "testOneTwo testThreeFour" it works. But this is not a solution for me as I use copy_to and edge_ngram as a token filter.

Any way to enforce this? Thanks!

Tomo_M · January 20, 2022, 11:56am

How about using join processor of ingest pipeline? You can concatenate the strings and it will work as "testOneTwo testThreeFour".

timon · January 20, 2022, 3:31pm

Sadly, this wont be working when I use copy_to. So I guess I have to join them before indexing...

system · February 17, 2022, 3:32pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Edge_ngram tokenizer and edge_ngram filter don't behave the same? Elasticsearch	1	356	December 30, 2020
How to add an analyzer that can remove duplicate tokens from the analyzed field? Elasticsearch	1	196	January 25, 2023
Remove duplicates from multi-valued field Elasticsearch	1	432	February 26, 2019
Unique tokenfilter issues? Elasticsearch	2	392	July 6, 2017
edgeNGram filter not keeping the whole words Elasticsearch	2	1341	July 6, 2017

Remove duplicate tokens after edge_ngram on array

Related topics