Default language for language detection

I don't know if this is the right category to post under, sorry if it's not.

I followed this guide: Language identification | Machine Learning in the Elastic Stack [7.16] | Elastic and it seems to return Japanese ("ja") for empty strings and numbers. Is there a default setting or something? I'd like it to return "en" in these cases.

Thanks for pointing out that deficiency.

I agree it's an issue and have opened [ML] What to do about lang_ident for empty strings and numbers? · Issue #81933 · elastic/elasticsearch · GitHub to discuss what to do.

In the short term you could add an extra set processor after the lang_ident inference processor that changes the predicted language field to en if the source field is an empty string. This set processor would use an if so that it only overrides the prediction for empty strings.

Nice that you may add it as a feature. I'm not sure if this is the best way or whether it's idiomatic, but I managed to accomplish it with this code as per David's instructions:

{
  "pipeline": {
    "processors": [
      {
        "inference": {
          "model_id": "lang_ident_model_1",
          "inference_config": {
            "classification": {
              "num_top_classes": 5
            }
          }
        }
      },
      {
        "set": {
          "if": "def t = ctx.text.trim(); t == \"\" || /^(\\d+(\\.\\d+)?)$/.matcher(t).matches()",
          "field": "ml.inference.predicted_value",
          "value": "en"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "text": "123.456"
      }
    }
  ]
}

It keeps the predicted_probability which is a bit unclean but I don't think it's important enough to add just for aesthetics :slight_smile:

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.