Indexing Images


(IronMike) #1

I am a bit confused about this topic, I would like to index images
(png,jpegs, gifs...), my understanding is that I need to extract and index
text portions from images, I don't really care for the meta data. So, I
looked online and decided to use apache Tika which I also use to extract
text and index pdfs (pdfs work fine).

  • How do I get the text part of images? All I am able to extract is
    metadata which I don't need.
  • Ideally I want to say if this image has no text to extract, then
    discard/ignore? Can you please clarify this topic a bit more and provide
    any samples if available? Additionaly, I don't want to store the 64 based
    encoded document.

PutMappingResponse putMappingResponse = new PutMappingRequestBuilder(

                               client.admin().indices() ).setIndices(

INDEX_NAME).setType(INDEX_TYPE).setSource(
XContentFactory.jsonBuilder
().startObject()

                                .startObject(INDEX_TYPE)

                                   .startObject("_source").field(

"enabled","no").endObject() //I believe this line will not store the base
64 whole _source, below I store the text portion of file only "file"

                                   .startObject("properties")

                                     .startObject("file")

                                       .field( "term_vector", 

"with_positions_offsets" )

                                       .field( "store", "no" )

                                       .field( "type", "attachment" )

                                       .field("fields")

                                          .startObject()

                                            .startObject("file")

                                                .field("store", "yes")

                                            .endObject()

                                        .endObject()

                                     .endObject()

                                   .endObject()

                                 .endObject()

                               .endObject()

                           ).execute().actionGet();


    public static void testImage(File file) throws IOException, 

SAXException,TikaException {

   Tika tika = new Tika();

   InputStream inputStream = new BufferedInputStream( new 

FileInputStream(file));

Metadata metadata = new Metadata();

ContentHandler handler = new DefaultHandler();

Parser parser = new JpegParser();

ParseContext context = new ParseContext();

String mimeType = tika.detect(inputStream);

metadata.set(Metadata.CONTENT_TYPE, mimeType);

parser.parse(inputStream,handler,metadata,context);

for(int i = 0; i <metadata.names().length; i++) { //metaData -I don't
care for this

   String name = metadata.names()[i];
   System.out.println(name + " : " + metadata.get(name));

 }

}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/dbfe132a-c25b-40f0-93a7-7957cf978004%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Pilato) #2

There is no OCR plugin. I tried to find something but did not really find anything useful: https://github.com/elasticsearch/elasticsearch-mapper-attachments/issues/10

To be honest, I think it was a false good idea. I mean that doing OCR inside elasticsearch nodes does not make sense to me.
This is something which should be done outside elasticsearch, for example in logstash if possible.

My 2 cents

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 20 février 2014 à 17:39:05, ZenMaster80 (sabdalla80@gmail.com) a écrit:

I am a bit confused about this topic, I would like to index images (png,jpegs, gifs...), my understanding is that I need to extract and index text portions from images, I don't really care for the meta data. So, I looked online and decided to use apache Tika which I also use to extract text and index pdfs (pdfs work fine).

  • How do I get the text part of images? All I am able to extract is metadata which I don't need.

  • Ideally I want to say if this image has no text to extract, then discard/ignore? Can you please clarify this topic a bit more and provide any samples if available? Additionaly, I don't want to store the 64 based encoded document.
    PutMappingResponse putMappingResponse = new PutMappingRequestBuilder(

                                 client.admin().indices() ).setIndices(INDEX_NAME).setType(INDEX_TYPE).setSource(
                                             XContentFactory.jsonBuilder().startObject()
    
                                  .startObject(INDEX_TYPE)
    
                                     .startObject("_source").field("enabled","no").endObject()  //I believe this line will not store the base 64 whole _source, below I store the text portion of file only "file"
    
                                     .startObject("properties")
    
                                       .startObject("file")
    
                                         .field( "term_vector", "with_positions_offsets" )
    
                                         .field( "store", "no" )
    
                                         .field( "type", "attachment" )
    
                                         .field("fields")
    
                                            .startObject()
    
                                              .startObject("file")
    
                                                  .field("store", "yes")
    
                                              .endObject()
    
                                          .endObject()
    
                                       .endObject()
    
                                     .endObject()
    
                                   .endObject()
    
                                 .endObject()
    
                             ).execute().actionGet();
    
    
    
      public static void testImage(File file) throws IOException, SAXException,TikaException {
    
     Tika tika = new Tika();
    
     InputStream inputStream = new BufferedInputStream( new FileInputStream(file));
    

    Metadata metadata = new Metadata();

    ContentHandler handler = new DefaultHandler();

    Parser parser = new JpegParser();

    ParseContext context = new ParseContext();

    String mimeType = tika.detect(inputStream);

    metadata.set(Metadata.CONTENT_TYPE, mimeType);

    parser.parse(inputStream,handler,metadata,context);

for(int i = 0; i <metadata.names().length; i++) { //metaData -I don't care for this

   String name = metadata.names()[i];
   System.out.println(name + " : " + metadata.get(name));

 }

}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/dbfe132a-c25b-40f0-93a7-7957cf978004%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.53063d4b.6763845e.3fd6%40MacBook-Air-de-David.local.
For more options, visit https://groups.google.com/groups/opt_out.


(IronMike) #3

Thanks David. I agree that OCR and maybe any kind of text extraction should
be done pre-Elastic Search indexing. But, I am just wondering if apache
tika supports this, or if anyone has experience with using a certain tool.
I do plan to do extract before indexing.

On Thursday, February 20, 2014 11:38:31 AM UTC-5, ZenMaster80 wrote:

I am a bit confused about this topic, I would like to index images
(png,jpegs, gifs...), my understanding is that I need to extract and index
text portions from images, I don't really care for the meta data. So, I
looked online and decided to use apache Tika which I also use to extract
text and index pdfs (pdfs work fine).

  • How do I get the text part of images? All I am able to extract is
    metadata which I don't need.
  • Ideally I want to say if this image has no text to extract, then
    discard/ignore? Can you please clarify this topic a bit more and provide
    any samples if available? Additionaly, I don't want to store the 64 based
    encoded document.

PutMappingResponse putMappingResponse = new PutMappingRequestBuilder(

                               client.admin().indices() ).setIndices(

INDEX_NAME).setType(INDEX_TYPE).setSource(
XContentFactory.jsonBuilder
().startObject()

                                .startObject(INDEX_TYPE)

                                   .startObject("_source").field(

"enabled","no").endObject() //I believe this line will not store the
base 64 whole _source, below I store the text portion of file only "file"

                                   .startObject("properties")

                                     .startObject("file")

                                       .field( "term_vector", 

"with_positions_offsets" )

                                       .field( "store", "no" )

                                       .field( "type", "attachment" )

                                       .field("fields")

                                          .startObject()

                                            .startObject("file")

                                                .field("store", "yes")

                                            .endObject()

                                        .endObject()

                                     .endObject()

                                   .endObject()

                                 .endObject()

                               .endObject()

                           ).execute().actionGet();


    public static void testImage(File file) throws IOException, 

SAXException,TikaException {

   Tika tika = new Tika();

   InputStream inputStream = new BufferedInputStream( new 

FileInputStream(file));

Metadata metadata = new Metadata();

ContentHandler handler = new DefaultHandler();

Parser parser = new JpegParser();

ParseContext context = new ParseContext();

String mimeType = tika.detect(inputStream);

metadata.set(Metadata.CONTENT_TYPE, mimeType);

parser.parse(inputStream,handler,metadata,context);

for(int i = 0; i <metadata.names().length; i++) { //metaData -I don't
care for this

   String name = metadata.names()[i];
   System.out.println(name + " : " + metadata.get(name));

 }

}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fac820d6-5343-4820-8acc-7e20c5663984%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #4