自定义分词器后,如何才能进行词干提取


(Dimon Ho) #1

假设我又这样一个索引:

PUT myindex
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer",
          "filter": [
            "lowercase",
            "my_stemmer"
          ]
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": "[;]+"
        }
      },
      "filter": {
        "my_stemmer": {
          "type": "stemmer",
          "name": "english"
        }
      }
    }
  }
}

测试分析器:

GET /myindex/_analyze
{
  "analyzer": "my_analyzer",
  "text": "running dates;Sex health education;Perceptions towards Sexual Health Education"
}

需要将running dates;Sex health education;Perceptions towards Sexual Health Education按分号分词,然后在对其进行词形还原,预期结果应该是:

{
  "tokens": [
    {
      "token": "run date",
      "start_offset": 0,
      "end_offset": 13,
      "type": "word",
      "position": 0
    },
    {
      "token": "sex health educ",
      "start_offset": 14,
      "end_offset": 34,
      "type": "word",
      "position": 1
    },
    {
      "token": "percept toward sexual health educ",
      "start_offset": 35,
      "end_offset": 78,
      "type": "word",
      "position": 2
    }
  ]
}

然而实际结果却是这样:

{
  "tokens": [
    {
      "token": "running d",
      "start_offset": 0,
      "end_offset": 13,
      "type": "word",
      "position": 0
    },
    {
      "token": "sex health educ",
      "start_offset": 14,
      "end_offset": 34,
      "type": "word",
      "position": 1
    },
    {
      "token": "perceptions towards sexual health educ",
      "start_offset": 35,
      "end_offset": 78,
      "type": "word",
      "position": 2
    }
  ]
}

该如何实现我的需求?


(system) #2

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.