I am a bit confused about this topic, I would like to index images
(png,jpegs, gifs...), my understanding is that I need to extract and index
text portions from images, I don't really care for the meta data. So, I
looked online and decided to use apache Tika which I also use to extract
text and index pdfs (pdfs work fine).
- How do I get the text part of images? All I am able to extract is
metadata which I don't need. - Ideally I want to say if this image has no text to extract, then
discard/ignore? Can you please clarify this topic a bit more and provide
any samples if available? Additionaly, I don't want to store the 64 based
encoded document.
PutMappingResponse putMappingResponse = new PutMappingRequestBuilder(
client.admin().indices() ).setIndices(
INDEX_NAME).setType(INDEX_TYPE).setSource(
XContentFactory.jsonBuilder
().startObject()
.startObject(INDEX_TYPE)
.startObject("_source").field(
"enabled","no").endObject() //I believe this line will not store the base
64 whole _source, below I store the text portion of file only "file"
.startObject("properties")
.startObject("file")
.field( "term_vector",
"with_positions_offsets" )
.field( "store", "no" )
.field( "type", "attachment" )
.field("fields")
.startObject()
.startObject("file")
.field("store", "yes")
.endObject()
.endObject()
.endObject()
.endObject()
.endObject()
.endObject()
).execute().actionGet();
public static void testImage(File file) throws IOException,
SAXException,TikaException {
Tika tika = new Tika();
InputStream inputStream = new BufferedInputStream( new
FileInputStream(file));
Metadata metadata = new Metadata();
ContentHandler handler = new DefaultHandler();
Parser parser = new JpegParser();
ParseContext context = new ParseContext();
String mimeType = tika.detect(inputStream);
metadata.set(Metadata.CONTENT_TYPE, mimeType);
parser.parse(inputStream,handler,metadata,context);
for(int i = 0; i <metadata.names().length; i++) { //metaData -I don't
care for this
String name = metadata.names()[i];
System.out.println(name + " : " + metadata.get(name));
}
}
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/dbfe132a-c25b-40f0-93a7-7957cf978004%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.