Elasticsearch Ingest Pipeline + index for language identification and text analysis

yolo · July 17, 2020, 1:17pm

Hi,

i am using elasticsearch ingest pipeline for language identification. Furthermore, i would like to apply to each language-field a language analyzer. To my knowledge there is no ingest pipeline for langauge analyzers so i created an index that will use my pipeline and apply the language analyzer from the mapping.

here is my index:

PUT my_index
{
  "settings": {
    "index.default_pipeline" : "my_sample_pipeline",
    "analysis" : {
      "analyzer": {
        "my_analyzer" : {
          "type" : "custom",
          "tokenizer" : "standard",
          "filter" : "my_apostrophe"
          }
      },
      "filter" : {
        "my_apostrophe" : {
          "type" : "asciifolding",
          "perserve_original": true
        }
      }
    }
  },  
  "mappings": {
    "dynamic": true,
    "properties": {
      "description": {
        "analyzer" : "my_analyzer",
        "type" : "text",
        "fields" : {
          "en_analyzer": {
            "type": "text",
            "analyzer": "english"
          },
          "de_analyzer": {
            "type": "text",
            "analyzer": "simple"
          },
          "pt_analyzer": {
            "type": "text",
            "analyzer": "portuguese"
          },
          "fr_analyzer": {
            "type": "text",
            "analyzer": "french"
          },
          "zh_analyzer": {
            "type": "text",
            "analyzer": "smartcn"
          }
        }
      }
    }
  }
}

and this is my pipeline:

// PUT _ingest/pipeline/my_sample_pipeline
{
  "processors" : [
    {
      "inference" : {
        "model_id" : "lang_ident_model_1",
        "inference_config": {
          "classification" : {
            "num_top_classes" : 1
          }
        },
        "field_map" : {
          "description" : "text"
        },
        "target_field" : "_ml.lang_ident"
        }
      },
      {
      "rename" : {
        "field" : "description",
        "target_field" : "description.raw"
        }
      },
      {
      "rename" : {
        "field" : "_ml.lang_ident.predicted_value",
        "target_field": "description.language_processed"
        }
      },
      {
      "script" : {
        "lang" : "painless",
        "source" : "ctx.description.supported = (['de', 'en', 'fr', 'pt', 'zh'].contains(ctx.description.language_processed))"
      }
      },
      {
      "set" : {
        "if" : "ctx.description.supported",
        "field": "description.{{description.language_processed}}",
        "value" : "{{description.raw}}",
        "override" : false
      }
      },
      {
      "set": {
        "if" : "ctx.description.language_processed == 'en'",
        "field" : "description.{{description.language_processed}}",
        "value" : "{{description.en_analyzer}"
      }
    }
  ]
}

after storing the description in its language-field using SCRIPT processor i am trying with the last SET processor to apply "description.en_analyzer" on the description i stored in en_field. Is this way of proceeding possible? the output of the last SET processor is an empty string field. Since the field en_analyzer in the index mapping is empty. my target is to apply to my "classified descriptions" the appropriate analyzers. Any ideas how to proceed?

system · August 14, 2020, 1:18pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Ingest pipeline for text analysis? Elasticsearch	12	1536	August 20, 2020
Language Detector Processor in Elasticsearch Ingest pipeline Elasticsearch	4	1036	April 25, 2018
Defining Elasticsearch mapping for ingest-attachment inner field Elasticsearch	4	1045	July 2, 2017
Analyzer in ingest node Elasticsearch	4	628	February 17, 2020
How do I use "lang" analyzers? Actually, should I use them? Elasticsearch	4	350	July 6, 2017

Elasticsearch Ingest Pipeline + index for language identification and text analysis

Related topics