Exact match using sentinel token technique (elasticsearch 7.x)

Hello,
I was reading the Relevant Search book and particularly the sentinel token technique, also described here.

Following the book recipe, the easiest way I could achieve an exact match is by creating a new field in my document so that:

{
   "name": "big blue rental car",
   "nameWithSentinels": "sentinel_begin big blue rental car sentinel_end"
}

and use a match_phrase query against the nameWithSentinels field with sentinel_begin <user_search> sentinel_end as a value.

However, I do wonder if there's a way a can achieve this without having to create the nameWithSentinels field but leveraging the creation of token filters.

I'm thinking of creating a filters that given the following tokens ["big", "blue", "rental", "car"] would be capable of returning ["sentinel_begin", "big", "blue", "rental", "car", "sentinel_end"] or ["sentinel_begin big", "blue", "rental", "car sentinel_end"](I believe the same match_phrase query presented above would still work )

So far my settings looks like this :

{
    "idx_0001": {
        "settings": {
         // stuff removed for brevety
            "analysis": {
                "filter": {
                    "sentinel_border_condition_end": {
                        "filter": [
                            "sentinel_border_end"
                        ],
                        "type": "condition",
                        "script": {
                            "source": "token.getPosition() === [NO IDEA HOW I CAN GET THE NUMBER OF TOKENS]"
                        }
                    },
                    "sentinel_border_begin": {
                        "pattern": "^",
                        "type": "pattern_replace",
                        "replacement": "SENTINEL_BEGIN"
                    },
                    {
                        "sentinel_border_end": {
                        "pattern": "$",
                        "type": "pattern_replace",
                        "replacement": "SENTINEL_END"
                    },
                    "sentinel_border_condition_begin": {
                        "filter": [
                            "sentinel_border_begin"
                        ],
                        "type": "condition",
                        "script": {
                            "source": "token.getPosition() === 0"
                        }
                    }
                },
                "analyzer": {
                    "my-analyzer": {
                        "filter": [
                            "sentinel_border_condition_begin",
                            "sentinel_border_condition_end",
                            "lowercase",
                            "stop"
                        ],
                        "type": "custom",
                        "tokenizer": "standard"
                    }
                }
            }
        }
    }
}

But as you can see I'm not sure of what would be the painless sciript for the sentinel_border_condition_end token filter.

Last resort would be creating a plugin but the documentation about token filters plugin is not large from what I can see.

Thank you in advance,

A.

I just realized I can achieve what I need using an ingest pipeline actually

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.