I am a bit confused about this topic, I would like to index images
(png,jpegs, gifs...), my understanding is that I need to extract and index
text portions from images, I don't really care for the meta data. So, I
looked online and decided to use apache Tika which I also use to extract
text and index pdfs (pdfs work fine).
- How do I get the text part of images? All I am able to extract is
metadata which I don't need. - Ideally I want to say if this image has no text to extract, then
discard/ignore? Can you please clarify this topic a bit more and provide
any samples if available? Additionaly, I don't want to store the 64 based
encoded document. 
PutMappingResponse putMappingResponse = new PutMappingRequestBuilder(
                               client.admin().indices() ).setIndices(
INDEX_NAME).setType(INDEX_TYPE).setSource(
XContentFactory.jsonBuilder
().startObject()
                                .startObject(INDEX_TYPE)
                                   .startObject("_source").field(
"enabled","no").endObject()  //I believe this line will not store the base
64 whole _source, below I store the text portion of file only "file"
                                   .startObject("properties")
                                     .startObject("file")
                                       .field( "term_vector", 
"with_positions_offsets" )
                                       .field( "store", "no" )
                                       .field( "type", "attachment" )
                                       .field("fields")
                                          .startObject()
                                            .startObject("file")
                                                .field("store", "yes")
                                            .endObject()
                                        .endObject()
                                     .endObject()
                                   .endObject()
                                 .endObject()
                               .endObject()
                           ).execute().actionGet();
    public static void testImage(File file) throws IOException, 
SAXException,TikaException {
   Tika tika = new Tika();
   InputStream inputStream = new BufferedInputStream( new 
FileInputStream(file));
Metadata metadata = new Metadata();
ContentHandler handler = new DefaultHandler();
Parser parser = new JpegParser();
ParseContext context = new ParseContext();
String mimeType = tika.detect(inputStream);
metadata.set(Metadata.CONTENT_TYPE, mimeType);
parser.parse(inputStream,handler,metadata,context);
for(int i = 0; i <metadata.names().length; i++) {  //metaData -I don't
care for this
   String name = metadata.names()[i];
   System.out.println(name + " : " + metadata.get(name));
 }
}
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/dbfe132a-c25b-40f0-93a7-7957cf978004%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.