Indexing Images

IronMike · February 20, 2014, 4:38pm

I am a bit confused about this topic, I would like to index images
(png,jpegs, gifs...), my understanding is that I need to extract and index
text portions from images, I don't really care for the meta data. So, I
looked online and decided to use apache Tika which I also use to extract
text and index pdfs (pdfs work fine).

How do I get the text part of images? All I am able to extract is
metadata which I don't need.
Ideally I want to say if this image has no text to extract, then
discard/ignore? Can you please clarify this topic a bit more and provide
any samples if available? Additionaly, I don't want to store the 64 based
encoded document.

PutMappingResponse putMappingResponse = new PutMappingRequestBuilder(

                               client.admin().indices() ).setIndices(

INDEX_NAME).setType(INDEX_TYPE).setSource(
XContentFactory.jsonBuilder
().startObject()

                                .startObject(INDEX_TYPE)

                                   .startObject("_source").field(

"enabled","no").endObject() //I believe this line will not store the base
64 whole _source, below I store the text portion of file only "file"

                                   .startObject("properties")

                                     .startObject("file")

                                       .field( "term_vector",

"with_positions_offsets" )

                                       .field( "store", "no" )

                                       .field( "type", "attachment" )

                                       .field("fields")

                                          .startObject()

                                            .startObject("file")

                                                .field("store", "yes")

                                            .endObject()

                                        .endObject()

                                     .endObject()

                                   .endObject()

                                 .endObject()

                               .endObject()

                           ).execute().actionGet();


    public static void testImage(File file) throws IOException,

SAXException,TikaException {

   Tika tika = new Tika();

   InputStream inputStream = new BufferedInputStream( new

FileInputStream(file));

Metadata metadata = new Metadata();

ContentHandler handler = new DefaultHandler();

Parser parser = new JpegParser();

ParseContext context = new ParseContext();

String mimeType = tika.detect(inputStream);

metadata.set(Metadata.CONTENT_TYPE, mimeType);

parser.parse(inputStream,handler,metadata,context);

for(int i = 0; i <metadata.names().length; i++) { //metaData -I don't
care for this

   String name = metadata.names()[i];
   System.out.println(name + " : " + metadata.get(name));

 }

}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/dbfe132a-c25b-40f0-93a7-7957cf978004%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

dadoonet · February 20, 2014, 5:37pm

There is no OCR plugin. I tried to find something but did not really find anything useful: https://github.com/elasticsearch/elasticsearch-mapper-attachments/issues/10

To be honest, I think it was a false good idea. I mean that doing OCR inside elasticsearch nodes does not make sense to me.
This is something which should be done outside elasticsearch, for example in logstash if possible.

My 2 cents

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 20 février 2014 à 17:39:05, ZenMaster80 (sabdalla80@gmail.com) a écrit:

I am a bit confused about this topic, I would like to index images (png,jpegs, gifs...), my understanding is that I need to extract and index text portions from images, I don't really care for the meta data. So, I looked online and decided to use apache Tika which I also use to extract text and index pdfs (pdfs work fine).

How do I get the text part of images? All I am able to extract is metadata which I don't need.

Ideally I want to say if this image has no text to extract, then discard/ignore? Can you please clarify this topic a bit more and provide any samples if available? Additionaly, I don't want to store the 64 based encoded document.
PutMappingResponse putMappingResponse = new PutMappingRequestBuilder(

                             client.admin().indices() ).setIndices(INDEX_NAME).setType(INDEX_TYPE).setSource(
                                         XContentFactory.jsonBuilder().startObject()

                              .startObject(INDEX_TYPE)

                                 .startObject("_source").field("enabled","no").endObject()  //I believe this line will not store the base 64 whole _source, below I store the text portion of file only "file"

                                 .startObject("properties")

                                   .startObject("file")

                                     .field( "term_vector", "with_positions_offsets" )

                                     .field( "store", "no" )

                                     .field( "type", "attachment" )

                                     .field("fields")

                                        .startObject()

                                          .startObject("file")

                                              .field("store", "yes")

                                          .endObject()

                                      .endObject()

                                   .endObject()

                                 .endObject()

                               .endObject()

                             .endObject()

                         ).execute().actionGet();



  public static void testImage(File file) throws IOException, SAXException,TikaException {

 Tika tika = new Tika();

 InputStream inputStream = new BufferedInputStream( new FileInputStream(file));

Metadata metadata = new Metadata();

ContentHandler handler = new DefaultHandler();

Parser parser = new JpegParser();

ParseContext context = new ParseContext();

String mimeType = tika.detect(inputStream);

metadata.set(Metadata.CONTENT_TYPE, mimeType);

parser.parse(inputStream,handler,metadata,context);

for(int i = 0; i <metadata.names().length; i++) { //metaData -I don't care for this

   String name = metadata.names()[i];
   System.out.println(name + " : " + metadata.get(name));

 }

}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/dbfe132a-c25b-40f0-93a7-7957cf978004%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.53063d4b.6763845e.3fd6%40MacBook-Air-de-David.local.
For more options, visit https://groups.google.com/groups/opt_out.

IronMike · February 20, 2014, 5:45pm

Thanks David. I agree that OCR and maybe any kind of text extraction should
be done pre-Elastic Search indexing. But, I am just wondering if apache
tika supports this, or if anyone has experience with using a certain tool.
I do plan to do extract before indexing.

On Thursday, February 20, 2014 11:38:31 AM UTC-5, ZenMaster80 wrote:

I am a bit confused about this topic, I would like to index images
(png,jpegs, gifs...), my understanding is that I need to extract and index
text portions from images, I don't really care for the meta data. So, I
looked online and decided to use apache Tika which I also use to extract
text and index pdfs (pdfs work fine).

How do I get the text part of images? All I am able to extract is
metadata which I don't need.

Ideally I want to say if this image has no text to extract, then
discard/ignore? Can you please clarify this topic a bit more and provide
any samples if available? Additionaly, I don't want to store the 64 based
encoded document.

PutMappingResponse putMappingResponse = new PutMappingRequestBuilder(
                               client.admin().indices() ).setIndices(
INDEX_NAME).setType(INDEX_TYPE).setSource(
XContentFactory.jsonBuilder
().startObject()
                                .startObject(INDEX_TYPE)

                                   .startObject("_source").field(
"enabled","no").endObject() //I believe this line will not store the
base 64 whole _source, below I store the text portion of file only "file"
                                   .startObject("properties")

                                     .startObject("file")

                                       .field( "term_vector", 
"with_positions_offsets" )
                                       .field( "store", "no" )

                                       .field( "type", "attachment" )

                                       .field("fields")

                                          .startObject()

                                            .startObject("file")

                                                .field("store", "yes")

                                            .endObject()

                                        .endObject()

                                     .endObject()

                                   .endObject()

                                 .endObject()

                               .endObject()

                           ).execute().actionGet();


    public static void testImage(File file) throws IOException, 
SAXException,TikaException {
   Tika tika = new Tika();

   InputStream inputStream = new BufferedInputStream( new 
FileInputStream(file));

Metadata metadata = new Metadata();

ContentHandler handler = new DefaultHandler();

Parser parser = new JpegParser();

ParseContext context = new ParseContext();

String mimeType = tika.detect(inputStream);
metadata.set(Metadata.CONTENT_TYPE, mimeType);
parser.parse(inputStream,handler,metadata,context);

for(int i = 0; i <metadata.names().length; i++) { //metaData -I don't
care for this
   String name = metadata.names()[i];
   System.out.println(name + " : " + metadata.get(name));

 }
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fac820d6-5343-4820-8acc-7e20c5663984%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Indexing binary Elasticsearch	7	451	July 6, 2017
Indexing pdf, word, text, image files Elasticsearch	2	678	April 27, 2017
How to index text files (pdf, doc, txt...) in Java? Elasticsearch	6	2631	January 18, 2023
Parse Pdf with image/text in elasticsearch using java? Elasticsearch	3	811	June 6, 2017
Ingest attachment image metadata Elasticsearch	2	1496	July 28, 2017

Indexing Images

Related topics