使用 highlight 的 boundary_chars参数后仍然从非边界字符截断

wpzdm · August 11, 2016, 12:03pm

已使用 highlight 的 boundary_chars参数，但有时仍然从非边界字符截断。

测试数据如下：

PUT wpz_2

PUT wpz_2/_mapping/test
{
  "properties": {
    "test": {
      "analyzer": "index_ansj", 
      "type": "string", 
      "term_vector": "with_positions_offsets"
    }
  }
}

PUT wpz_2/test/3
{
  "test": "，全市工业80%以上的大型装备实现了信息化集成。投资2000万元启动“智慧企业”专项行动，重点支持工业企业无线、物联技术应用。"
}
GET  wpz_2/test/_search
{
  "query": {
    "match": {
      "test": "智慧"
    }
  }
  , "highlight": {
    "boundary_chars": ".,!?；，。？！",
    "fragment_size": 30,
    "fields": {
      "test": {}
    }
  }
}

输出：

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.057534903,
    "hits": [
      {
        "_index": "wpz_2",
        "_type": "test",
        "_id": "3",
        "_score": 0.057534903,
        "_source": {
          "test": "，全市工业80%以上的大型装备实现了信息化集成。投资2000万元启动“智慧企业”专项行动，重点支持工业企业无线、物联技术应用。"
        },
        "highlight": {
          "test": [
            "集成。投资2000万元启动“<em>智慧</em>企业”专项行动，重点支持工业企业无线、物联技术应用"
          ]
        }
      }
    ]
  }
}

注意在高亮片段的开头没有按 boundary char '。' 截断。

analyzer用standard结果是一样的，所以不应该是分析器的问题。

我猜是boundary_chars和fragment_size两个参数共同起作用的机制有问题，不知道有没有办法保证开始出一定是从boundary char截断的？

ES版本2.3.3

谢谢！

medcl.net · September 1, 2016, 3:22am

fragment size 要优先考虑，保证fragment前后有足够的上下文信息，你可以调小点fragment_size，如25试试

wpzdm · September 12, 2016, 9:42am

感谢回复。

我试了一下，似乎不是保证fragment size可以解释的。
比如
投资2000万元启动“智慧企业”专项行动，重点支持工业企业无线、物联技术应用
这个字符串长度远超25，但是需要fragment_size设为25才不在highlight中带上开头的‘。’
而且往上增加fragment_size也不必然增加最后highlight的长度，
比如fragment_size为26 和 27 时，生成的highlight都是
。投资2000万元启动“智慧企业”专项行动，重点支持工业企业无线、物联技术应用

上面的情况同样也是跟分析器无关。

Topic		Replies	Views
Highlight parameter 'boundary_chars' does not work as expected Elasticsearch	1	584	July 5, 2017
Fast vector highlighting boundaries does not work properly Elasticsearch	1	443	July 6, 2017
Boundary_char not working as expected ES 2.3.4 Elasticsearch	1	456	January 2, 2018
Highlighting Boundary characters are not working in elastic search 1.7.1 Elasticsearch	2	826	July 5, 2017
Boundary_chars not working Elasticsearch	1	1378	July 6, 2017

使用 highlight 的 boundary_chars参数后仍然从非边界字符截断

Related topics