Ingest-attachment can't recognize content type for text encoded in win-1251

Hi.

Use ES 6.7.2 with ingest-attachment plugin.
I have text files (*.txt) in win-1251 encoding with russian chars. After put binary data to index, documents in index have no source, because ingest processor dont recognize content. If there is no russian chars - everything is ok.

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
      "attachment": {
        "field": "data",
        "target_field": "attachment",
        "properties": [
          "content",
          "content_length",
          "content_type",
          "language"
        ],
        "indexed_chars": -1
      }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "data": "8PPx"
      }
    }
  ]
}

returns

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_type",
        "_id" : "_id",
        "_source" : {
          "data" : "8PPx",
          "attachment" : {
            "content_type" : "application/octet-stream",
            "content_length" : 0
          }
        },
        "_ingest" : {
          "timestamp" : "2019-05-30T13:08:18.581Z"
        }
      }
    }
  ]
}

"data": "8PPx"
8PPx - this is text "рус" encoded in base64.

I tested recognition of text file via tika-app.jar command line - encoding detects right: "Content-Type: text/plain; charset=windows-1251".

Why ingest-attachment not detect content type as "text/plain; charset=windows-1251"?

upd: for more complex text are the same "8uXq8fIg4iDq7uTo8O7i6uUgMTI1MQ" -> "текст в кодировке 1251").

It seems the problem in tiko. It's dont recognize non-US ASCII chars as text without hints like file extension.

public boolean isMostlyAscii() {
    int control = count(0, 0x20);
    int ascii = count(0x20, 128);
    int safe = countSafeControl();
    return total > 0
            && (control - safe) * 100 < total * 2
            && (ascii + safe) * 100 > total * 90;
}

But national ascii chars has codes 0xc0-0xff.

If in test application specify file without extension, it recognize content like "application/octet-stream".
I didn't found any settings, which can help me. :frowning: May be try to fix tica-core.jar.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.