The .txt file can't search


(Zikezi) #1

Hello, I use elastic2.1.2,mapper-attachments3.1.2, jdk1.7 64bit. all the config is default,don't changed,At present, I encountered a problem,when i upload .txt files, the elasticearch can't search the txt content. but ms office file is right. Here is my cable indexing infos:
1> the index mapping

{ "mappings": { "lesson": { "properties": { "sysOrganizationSid": { "type": "long" }, "atts": { "properties": { "file": { "type": "attachment", "fields": { "content": { "type": "string" }, "author": { "type": "string" }, "title": { "type": "string" }, "keywords": { "type": "string" }, "name": { "type": "string" }, "language": { "type": "string" }, "content_length": { "type": "integer" }, "date": { "format": "strict_date_optional_time||epoch_millis", "type": "date" }, "content_type": { "type": "string" } } }, "name": { "type": "string" }, "cname": { "type": "string" } } } } } } }

2>creat index

  • use java build base64 code

public static String file2Base64(String path) throws Exception { File file = new File(path);; FileInputStream inputFile = new FileInputStream(file); byte[] buffer = new byte[(int) file.length()]; inputFile.read(buffer); inputFile.close(); return new BASE64Encoder().encode(buffer); }

  • Final data for the index:

{
"_index": "xyinfo",
"_type": "lesson",
"_id": "1030",
"_version": 1,
"_score": 1,
"_source": {
"platOrganizationSid": 1001,
"atts": [
{
"cname": "新建文本文档 (3).txt",
"file": "1NrJz9K7xqq52NPaTHVjZW5ltcSyqb/N1tCjrL3pydzBy0x1Y2VuZbXEyOvDxdLUvLDW0M7EzsS8 /rXEy9HL987KzOK94r72o6zG5NbQyrnTw7XEwP3X08rH0tTOxLG+zsS8/tf3zqrL0cv3tcTOxLW1 o6zU2rS0vajL99L9yrHKudPDwctSZWFkZXK2wcihzsS8/qGjtavKx9TayrW8yrXE06bTw7n9s8zW 0KOsvq2zo9Do0qq21Lj31ta499H5tcS3x87Esb7OxLz+tcTE2sjdvfjQ0MirzsTL0cv3oaPO0sPH s6PTw7XEs/3By3R4dKGiaHRtbKGieG1stcjOxLG+uPHKvbXEzsS1tc3io6y7udPQtPPBv7XEwP3I 53BkZqGid29yZKGicHB0tci3x87Esb648cq9tcTOxLW1o6zU2rbU1eLQqc7EtbXW0LXExNrI3b34 0NDL0cv3yrGjrL7N0OjSqs/Is+nIocbkxNrI3aOsyLu689TZuPq9+MbkxNrI3bS0vajL99L9o6yy xcTcsbvV/ci3y9HL96GjQXBhY2hlIFRpa2Egvs3Kx9K7v+7Hv7TztcTOxLW1xNrI3bPpyKG/8rzc o6zL/Lyvs8nBy7j31tbOxLW1veLO9sb3o6zE3Lm7yrax8LTztuDK/bXEzsS1taOssqLH0sTcubvA qdW5xuTL+7XEveLO9sb3o6y2+MfSttTW0M7EtcTKtrHw0rK9z7rDoaO+rbn9srvN6sirsuLK1KOs xNy5u8q2sfC1xM7EtbW48cq9yOfPwqO6DQpwZGbOxLW1DQpkb2OhomRvY3ihonBwdKGiZXhjZWwN CnR4dKGiaHRtbKGieG1sDQp6aXChonRhcg0K0tTJzzTA4M7EtbW7+bG+yc+w/LqswcvO0sPH1Nq0 tL2o0ru49tfKwc+/4sqxy/nKudPDtcTOxLW1uPHKvaGj"
}
]
}
}

3>this is my query

POST http://127.0.0.1:9200/xyinfo/lesson/_search
{
"query": {
"match": {
"atts.file.content": "博客"
}
}
}

this is my .txt file content:

在上一篇关于Lucene的博客中,介绍了Lucene的入门以及中文文件的搜索问题解决,其中使用的例子是以文本文件作为搜索的文档,在创建索引时使用了Reader读取文件。但是在实际的应用过程中,经常需要对各种各样的非文本文件的内容进行全文搜索。我们常用的除了txt、html、xml等文本格式的文档外,还有大量的例如pdf、word、ppt等非文本格式的文档,在对这些文档中的内容进行搜索时,就需要先抽取其内容,然后再跟进其内容创建索引,才能被正确搜索。Apache Tika 就是一款强大的文档内容抽取框架,它集成了各种文档解析器,能够识别大多数的文档,并且能够扩展其他的解析器,而且对中文的识别也较好。经过不完全测试,能够识别的文档格式如下:
pdf文档
doc、docx、ppt、excel
txt、html、xml
zip、tar
以上4类文档基本上包含了我们在创建一个资料库时所使用的文档格式。

This is the result of the search:

####Where am I wrong operation, please correct!thanks!


(David Pilato) #2

I wonder if it could be caused by an encoding issue. I mean that elasticsearch supposes that all characters are UTF8 encoded files. Might be not the case for your TXT file?


(system) #3