Elasticsearch Ingest Pipeline + index for language identification and text analysis

Hi,

i am using elasticsearch ingest pipeline for language identification. Furthermore, i would like to apply to each language-field a language analyzer. To my knowledge there is no ingest pipeline for langauge analyzers so i created an index that will use my pipeline and apply the language analyzer from the mapping.

here is my index:

PUT my_index
{
  "settings": {
    "index.default_pipeline" : "my_sample_pipeline",
    "analysis" : {
      "analyzer": {
        "my_analyzer" : {
          "type" : "custom",
          "tokenizer" : "standard",
          "filter" : "my_apostrophe"
          }
      },
      "filter" : {
        "my_apostrophe" : {
          "type" : "asciifolding",
          "perserve_original": true
        }
      }
    }
  },  
  "mappings": {
    "dynamic": true,
    "properties": {
      "description": {
        "analyzer" : "my_analyzer",
        "type" : "text",
        "fields" : {
          "en_analyzer": {
            "type": "text",
            "analyzer": "english"
          },
          "de_analyzer": {
            "type": "text",
            "analyzer": "simple"
          },
          "pt_analyzer": {
            "type": "text",
            "analyzer": "portuguese"
          },
          "fr_analyzer": {
            "type": "text",
            "analyzer": "french"
          },
          "zh_analyzer": {
            "type": "text",
            "analyzer": "smartcn"
          }
        }
      }
    }
  }
}

and this is my pipeline:

// PUT _ingest/pipeline/my_sample_pipeline
{
  "processors" : [
    {
      "inference" : {
        "model_id" : "lang_ident_model_1",
        "inference_config": {
          "classification" : {
            "num_top_classes" : 1
          }
        },
        "field_map" : {
          "description" : "text"
        },
        "target_field" : "_ml.lang_ident"
        }
      },
      {
      "rename" : {
        "field" : "description",
        "target_field" : "description.raw"
        }
      },
      {
      "rename" : {
        "field" : "_ml.lang_ident.predicted_value",
        "target_field": "description.language_processed"
        }
      },
      {
      "script" : {
        "lang" : "painless",
        "source" : "ctx.description.supported = (['de', 'en', 'fr', 'pt', 'zh'].contains(ctx.description.language_processed))"
      }
      },
      {
      "set" : {
        "if" : "ctx.description.supported",
        "field": "description.{{description.language_processed}}",
        "value" : "{{description.raw}}",
        "override" : false
      }
      },
      {
      "set": {
        "if" : "ctx.description.language_processed == 'en'",
        "field" : "description.{{description.language_processed}}",
        "value" : "{{description.en_analyzer}"
      }
    }
  ]
}

after storing the description in its language-field using SCRIPT processor i am trying with the last SET processor to apply "description.en_analyzer" on the description i stored in en_field. Is this way of proceeding possible? the output of the last SET processor is an empty string field. Since the field en_analyzer in the index mapping is empty. my target is to apply to my "classified descriptions" the appropriate analyzers. Any ideas how to proceed?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.