Hi.
Use ES 6.7.2 with ingest-attachment plugin.
I have text files (*.txt) in win-1251 encoding with russian chars. After put binary data to index, documents in index have no source, because ingest processor dont recognize content. If there is no russian chars - everything is ok.
POST _ingest/pipeline/_simulate
{
"pipeline": {
"processors": [
{
"attachment": {
"field": "data",
"target_field": "attachment",
"properties": [
"content",
"content_length",
"content_type",
"language"
],
"indexed_chars": -1
}
}
]
},
"docs": [
{
"_source": {
"data": "8PPx"
}
}
]
}
returns
{
"docs" : [
{
"doc" : {
"_index" : "_index",
"_type" : "_type",
"_id" : "_id",
"_source" : {
"data" : "8PPx",
"attachment" : {
"content_type" : "application/octet-stream",
"content_length" : 0
}
},
"_ingest" : {
"timestamp" : "2019-05-30T13:08:18.581Z"
}
}
}
]
}
"data": "8PPx"
8PPx - this is text "рус" encoded in base64.
I tested recognition of text file via tika-app.jar command line - encoding detects right: "Content-Type: text/plain; charset=windows-1251".
Why ingest-attachment not detect content type as "text/plain; charset=windows-1251"?
upd: for more complex text are the same "8uXq8fIg4iDq7uTo8O7i6uUgMTI1MQ" -> "текст в кодировке 1251").