problem on highlighting (sentence cutting)

Hello,

I've a problem on highlighting. When I search for the keyword "éolienne", Elastic highlights l' instead of the l'éolienne. (see picture problem_highlight.png)

The complete sentene is :
"""Daulitz, Domaine de Larroque Vieil, Ferme de Clèche, impasse As Prats, impasse de la Forge, impasse de l’Eolienne, impasse du Merle, impasse du Pinson, impasse Flourac, impasse Loubine, impasse Lucie Aubrac, impasse Pierre Dupont, impasse Redorthe, impasse Roundy"""
And I got this in highlight :
"""Daulitz, Domaine de Larroque Vieil, Ferme de Clèche, impasse As Prats,
impasse de la Forge, impasse de l’"""

My query is :
{
"highlight": {
"boundary_scanner": "sentence",
"boundary_scanner_locale":"fr-FR",
"fields": {
"*": {}
}
},
"query": {
"bool": {
"boost": 1,
"filter": ,
"must": [
{
"multi_match": {
"fields": "texte_extrait.raw",
"query": "éolienne",
"type": "phrase"
}
}
]
}
}
}
In the same index, I've correct highlights in other documents. (see picture highlight_ok.png)
I think there is a problem with sentence splitting. Do you have a solution to resolve this problem ?

Thanks a lot.

Minwei DENG


Hi @minwei.deng,

Welcome! Can you share your index mapping and which analyzer are you using? Are you using the standard analyzer or a particular language analyzer?

Hi @carly.richmond, thanks a lot for looking at my case.

I use this analyzer "analyse_texte_libre" :

"analyse_texte_libre": {
            "filter": [
              "asciifolding",
              "lowercase",
              "fr_worddelimiter",
              "fr_elision",
              "fr_stop",
              "fr_snowball",
              "samepos_unique"
            ],
            "charfilter": [
              "fr_abrev_mapping",
              "urls_filter",
              "html_strip",
              "num_filter"
            ],
            "type": "custom",
            "tokenizer": "standard"
          }

And my filters are :

"filter": {
          "fr_stop": {
            "ignore_case": "true",
            "remove_trailing": "true",
            "type": "stop",
            "stopwords": "_french_"
          },
          "samepos_unique": {
            "type": "unique",
            "only_on_same_position": "true"
          },
          "fr_worddelimiter": {
            "catenate_all": "true",
            "split_on_numerics": "false",
            "language": "french",
            "split_on_case_change": "false",
            "type": "word_delimiter"
          },
          "fr_snowball": {
            "type": "snowball",
            "language": "french"
          },
          "french_stemmer": {
            "type": "stemmer",
            "language": "light_french"
          },
          "fr_elision": {
            "type": "elision",
            "articles": [
              "l",
              "m",
              "t",
              "qu",
              "n",
              "s",
              "j",
              "d",
              "c",
              "jusqu",
              "quoiqu",
              "lorsqu",
              "puisqu",
              "parce qu",
              "parcequ",
              "entr",
              "presqu",
              "quelqu"
            ]
          }

And my charfilters are :

"charfilter": {
          "fr_abrev_mapping": {
            "type": "mapping",
            "mappings": [
              "k€ => milliers d'euros",
              "m€ => millions d'euros",
              "€ => euros",
              "m² => mètres carrés",
              "m2 => mètres carrés",
              "1er => premier",
              "© => copyright",
              ", => ",
              ". => ",
              "; => ",
              ": => ",
              "? => ",
              "! => "
            ]
          },
          "url_filter": {
            "pattern": "(http|ftp|www)\\S*",
            "type": "pattern_replace",
            "replacement": ""
          },
          "num_filter": {
            "pattern": "\\d+",
            "type": "pattern_replace",
            "replacement": " "
          },
          "rem_html": {
            "type": "html_strip"
          }
        }

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.