Default language for language detection

uu99ix · December 20, 2021, 2:20pm

I don't know if this is the right category to post under, sorry if it's not.

I followed this guide: Language identification | Machine Learning in the Elastic Stack [7.16] | Elastic and it seems to return Japanese ("ja") for empty strings and numbers. Is there a default setting or something? I'd like it to return "en" in these cases.

droberts195 · December 20, 2021, 3:01pm

Thanks for pointing out that deficiency.

I agree it's an issue and have opened [ML] What to do about lang_ident for empty strings and numbers? · Issue #81933 · elastic/elasticsearch · GitHub to discuss what to do.

In the short term you could add an extra set processor after the lang_ident inference processor that changes the predicted language field to en if the source field is an empty string. This set processor would use an if so that it only overrides the prediction for empty strings.

uu99ix · December 21, 2021, 1:01am

Nice that you may add it as a feature. I'm not sure if this is the best way or whether it's idiomatic, but I managed to accomplish it with this code as per David's instructions:

{
  "pipeline": {
    "processors": [
      {
        "inference": {
          "model_id": "lang_ident_model_1",
          "inference_config": {
            "classification": {
              "num_top_classes": 5
            }
          }
        }
      },
      {
        "set": {
          "if": "def t = ctx.text.trim(); t == \"\" || /^(\\d+(\\.\\d+)?)$/.matcher(t).matches()",
          "field": "ml.inference.predicted_value",
          "value": "en"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "text": "123.456"
      }
    }
  ]
}

It keeps the predicted_probability which is a bit unclean but I don't think it's important enough to add just for aesthetics

system · January 18, 2022, 1:01am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Language Identification in Elastic Cloud Elasticsearch	3	400	October 20, 2020
Language detection Kibana	2	222	June 23, 2022
Language Detector Processor in Elasticsearch Ingest pipeline Elasticsearch	4	1085	April 25, 2018
Does elasticsearch-6.8.4 support language detection? Elasticsearch	4	423	January 20, 2020
Failure: [lang_ident_neural_network] model could not find non-null numerical array named [embedding_vector] Elasticsearch elastic-stack-machine-learning , ingest-pipeline	2	341	October 15, 2021

Default language for language detection

Related topics