Ingest pipeline for text analysis?

Hi community,

is it possible to apply text analysis on search index using ingest pipelines? i still did not find any processor suitable for this task. The idea is to detect the language using inference processor from ingest pipeline and than apply language specific analyzers on the text. Is this even possible with elasticsearch?

Have a look at https://www.elastic.co/blog/multilingual-search-using-language-identification-in-elasticsearch

Yes i have seen it. But still i am not sure if it is possible after language identification to use built-in language for each field. This is somehow not working. Not even in the example on that website. I used a serie of processors to store the input field in language specific field.
I am now struggling with how to apply language analyzers to those fields. Is it even possible in ES?

  • The language fields are difined in the pipeline and the language analyzers in the mapping.

Isn't what " Per-Field" section is about?

i used the igest pipeline processors to create language per-field. Now i would like to apply language analyzers on the fields. But it somehow not working. the text is not analyzed.

my index:

{
  "settings": {
    "index.default_pipeline" : "my_sample_pipeline",
    "analysis" : {
      "analyzer": {
        "my_analyzer" : {
          "type" : "custom",
          "tokenizer" : "standard"
             }
          }
      }
    }
  },  
  "mappings": {
    "dynamic": true,
    "properties": {
      "description": {
        "analyzer" : "my_analyzer",
        "type" : "text",
        "properties" : {
          "en_analyzer": {
            "type": "text",
            "analyzer": "english"
          },
          "de_analyzer": {
            "type": "text",
            "analyzer": "simple"
          },
          "fr_analyzer": {
            "type": "text",
            "analyzer": "french"
          }
        }
      }
    }
  }
}

How do you know?

i know this from my pipeline. I used second SET processor to assign to each language-per field a language analyzer . But i am not sure if i can use my analyzers in the pipeline.

My pipeline:

PUT _ingest/pipeline/my_sample_pipeline
{
  "processors" : [
    {
      "inference" : {
        "model_id" : "lang_ident_model_1",
        "inference_config": {
          "classification" : {
            "num_top_classes" : 1
          }
        },
        "field_map" : {
          "description" : "text"
        },
        "target_field" : "_ml.lang_ident"
        }
      },
      {
      "rename" : {
        "field" : "description",
        "target_field" : "description.raw"
        }
      },
      {
      "rename" : {
        "field" : "_ml.lang_ident.predicted_value",
        "target_field": "description.language_processed"
        }
      },
      {
      "script" : {
        "lang" : "painless",
        "source" : "ctx.description.supported = (['de', 'en', 'fr', 'pt', 'zh'].contains(ctx.description.language_processed))"
      }
      },
      {
      "set" : {
        "if" : "ctx.description.supported",
        "field": "description.{{description.language_processed}}",
        "value" : "{{description.raw}}",
        "override" : false
      }  
      },
      {
      "set" : {
        "if" : "ctx.description.language == 'en'"
        "field": "description.{{description.language_processed",
        "value" : "{{description.en_analyzer}}"
      }
    }
  ]
}

Once the field has been set, the right analyzer for that field will be used. I don't understand what you're taking about.

is there any possibility to store the analyzed text in a field and see how does the text look like after the pre-processing? (i know about Analyze API) but i would like to see it for the whole data. I was thinking using ingest pipeline allows me this .

No. The analyze API is built for that need.

when using language analyzers in index mapping , are the analyzers applied to the indexed text / indexing time (my datasets) or to query text /search time? Thank youuuu

Both.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.