Ingest-attachment can't recognize content type for text encoded in win-1251

aledyad · May 30, 2019, 1:23pm

Hi.

Use ES 6.7.2 with ingest-attachment plugin.
I have text files (*.txt) in win-1251 encoding with russian chars. After put binary data to index, documents in index have no source, because ingest processor dont recognize content. If there is no russian chars - everything is ok.

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
      "attachment": {
        "field": "data",
        "target_field": "attachment",
        "properties": [
          "content",
          "content_length",
          "content_type",
          "language"
        ],
        "indexed_chars": -1
      }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "data": "8PPx"
      }
    }
  ]
}

returns

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_type",
        "_id" : "_id",
        "_source" : {
          "data" : "8PPx",
          "attachment" : {
            "content_type" : "application/octet-stream",
            "content_length" : 0
          }
        },
        "_ingest" : {
          "timestamp" : "2019-05-30T13:08:18.581Z"
        }
      }
    }
  ]
}

"data": "8PPx"
8PPx - this is text "рус" encoded in base64.

I tested recognition of text file via tika-app.jar command line - encoding detects right: "Content-Type: text/plain; charset=windows-1251".

Why ingest-attachment not detect content type as "text/plain; charset=windows-1251"?

upd: for more complex text are the same "8uXq8fIg4iDq7uTo8O7i6uUgMTI1MQ" -> "текст в кодировке 1251").

aledyad · June 5, 2019, 12:15pm

It seems the problem in tiko. It's dont recognize non-US ASCII chars as text without hints like file extension.

public boolean isMostlyAscii() {
    int control = count(0, 0x20);
    int ascii = count(0x20, 128);
    int safe = countSafeControl();
    return total > 0
            && (control - safe) * 100 < total * 2
            && (ascii + safe) * 100 > total * 90;
}

But national ascii chars has codes 0xc0-0xff.

If in test application specify file without extension, it recognize content like "application/octet-stream".
I didn't found any settings, which can help me. May be try to fix tica-core.jar.

system · July 3, 2019, 12:15pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Troubles with different file types using ingest attachment processor plugin Elasticsearch	8	3333	February 23, 2017
No handler for type [attachment] declared on field [my_attachment] Elasticsearch	14	2662	August 19, 2018
Sending Attachments: Unexpected end-of-input in VALUE_STRING Elasticsearch	20	8451	July 6, 2017
Advantages of base64 encoded content in ingest attachment plugin Elasticsearch	3	1631	May 1, 2018
Ingest-attachment not parsing docx Elasticsearch	8	1256	June 27, 2018

Ingest-attachment can't recognize content type for text encoded in win-1251

Related topics