Processor for shingles, similar to split?

Is there a processor that will do shingles or can I make a custom one somehow?

In the pipeline processor below, I split on the space character, but I'd also like to combine words like a shingle analyzer would:

PUT _ingest/pipeline/split
{
  "processors": [
    {
      "split": {
        "field": "title",
        "target_field": "title_suggest.input",
        "separator": "\\s+"
      }
    }
  ]
}

Example:

"Senior Business Developer" needs a suggestion field with these terms.

  1. Senior Business Developer
  2. Business Developer
  3. Developer

Any ideas are appreciated, thanks!

Well I just created a script to do it. It's very basic but here it is:

PUT _ingest/pipeline/script
{
  "processors": [
    {
      "script": {
        "lang": "painless",
        "source": """
          if (!ctx.containsKey('title')) { return; }
          def title_words = ctx['title'].splitOnToken(' ');
          def title_suggest = [];
          for (def i = 0; i < title_words.length; i++) {
            def shingle = title_words[i];
            title_suggest.add(shingle);
            for (def j = i + 1; j < title_words.length; j++) {
              shingle = shingle + ' ' + title_words[j];
              title_suggest.add(shingle);
            }
            
          }
          ctx['title_suggest']=title_suggest;
        """
      }
    }
  ]
}

Usage:

PUT /item
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "suggest_analyzer": {
            "type": "custom",
            "tokenizer": "standard",
            "filter": [
              "lowercase",
              "asciifolding"
            ]
          }
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text"
      },
      "title_suggest": {
        "type": "completion"
      }
    }
  }
}

PUT /item/_doc/1?pipeline=script
{
  "title": "Diabetes Mellitus Type 1"
}

Result:

GET /item/_doc/1

{
  "_index" : "item",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "_seq_no" : 24,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "title_suggest" : [
      "Diabetes",
      "Diabetes Mellitus",
      "Diabetes Mellitus Type",
      "Diabetes Mellitus Type 1",
      "Mellitus",
      "Mellitus Type",
      "Mellitus Type 1",
      "Type",
      "Type 1",
      "1"
    ],
    "title" : "Diabetes Mellitus Type 1"
  }
}

Note: It sucks that I can't just use the built-in shingle analyzer to break up the text into shingles and then insert that into another field.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.