Indexing binary

IronMike · February 26, 2014, 11:20pm

I index PDFs using apache with the following mapping.

.field( "type", "attachment" )

.field("fields")

.startObject()

.startObject("file")

.field("store", "yes")

.endObject()

I want to index photos, I am able to extract text using OCR. I am confused
how to index the text though, do I treat it like any document and not as an
attachment? I have text as "String" when extracted and not base 64 like in
the case of pdfs?
I am confused to how it gets stored and how does it work if I need to make
it available during search? Can someone explain on how I do this?

XContentFactory.jsonBuilder().startObject()

           .startObject(INDEX_TYPE) 

           .startObject("_source").field("enabled","no").endObject()  //This

line will not store/not store the base 64 whole _source

             .startObject("properties")

So, My photo object becomes something like this, what about the source (the
image itself ?)
jsonObject
{
"content":"text extracted from image"
"name":"my_photo.png"
}

//add to the bulk indexer for indexing

bulkProcessor.add(Requests.indexRequest(INDEX_NAME).type(INDEX_TYPE).id(
jsonObject.getString("name")).source(jsonObject.toString()));

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2012d7c6-b499-4318-8ae7-512879e5e8b8%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Binh_Ly_2 · February 27, 2014, 1:29pm

You certainly can add a new field, and then just put the OCR text into that
new field. So for example:

Mapping:

    PutMappingResponse putMappingResponse = new

PutMappingRequestBuilder(

client.admin().indices()).setIndices(INDEX_NAME).setType(DOCUMENT_TYPE).setSource(
XContentFactory.jsonBuilder().startObject()
.field(DOCUMENT_TYPE).startObject()
.field("properties").startObject()
.field("text").startObject()
.field("type", "string")
.endObject()
.field("file").startObject()
.field("store", "yes")
.field("type", "attachment")
.field("fields").startObject()
.field("file").startObject()
.field("store", "yes")
.endObject()
.endObject()
.endObject()
.endObject()
.endObject()
.endObject()
).execute().actionGet();

Then put the OCR text into the "text" field:

    IndexResponse indexResponse = client.prepareIndex(INDEX_NAME,

DOCUMENT_TYPE, "1")
.setSource(XContentFactory.jsonBuilder().startObject()
.field("text", ocrText)
.field("file").startObject()
.field("content", fileContents)
.field("_indexed_chars", -1)
.endObject()
.endObject()
).execute().actionGet();

You probably don't need to index the image binary information - not sure
what you would need it for.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/89b4bdc6-b128-49af-b14d-93694dbb46d1%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

IronMike · February 27, 2014, 5:54pm

Thanks, it sounds like you are treating it as an attachment, In your
example, what is the "fileContents" in .field("content", fileContents) ?
How do I get file contents of an image, I know in the case of the pdf, this
is content text of the pdf.
Correct, I don't want to index the image binary, I just need to be able to
pull up the image when it's text field has a match.

On Thursday, February 27, 2014 8:29:25 AM UTC-5, Binh Ly wrote:

You certainly can add a new field, and then just put the OCR text into
that new field. So for example:

Mapping:
    PutMappingResponse putMappingResponse = new 
PutMappingRequestBuilder(

client.admin().indices()).setIndices(INDEX_NAME).setType(DOCUMENT_TYPE).setSource(
XContentFactory.jsonBuilder().startObject()
.field(DOCUMENT_TYPE).startObject()
.field("properties").startObject()
.field("text").startObject()
.field("type", "string")
.endObject()
.field("file").startObject()
.field("store", "yes")
.field("type", "attachment")
.field("fields").startObject()
.field("file").startObject()
.field("store", "yes")
.endObject()
.endObject()
.endObject()
.endObject()
.endObject()
.endObject()
).execute().actionGet();

Then put the OCR text into the "text" field:
    IndexResponse indexResponse = client.prepareIndex(INDEX_NAME, 
DOCUMENT_TYPE, "1")
.setSource(XContentFactory.jsonBuilder().startObject()
.field("text", ocrText)
.field("file").startObject()
.field("content", fileContents)
.field("_indexed_chars", -1)
.endObject()
.endObject()
).execute().actionGet();

You probably don't need to index the image binary information - not sure
what you would need it for.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/da68600a-c2ec-4728-8461-644d4dab7b39%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Binh_Ly_2 · February 27, 2014, 6:16pm

Oh, the attachment part is for your PDF. If you don't need to index PDFs
then just remove that part:

    PutMappingResponse putMappingResponse = new

PutMappingRequestBuilder(
client.admin().indices()).
setIndices(INDEX_NAME).setType(DOCUMENT_TYPE).setSource(
XContentFactory.jsonBuilder().startObject()
.field(DOCUMENT_TYPE).startObject()
.field("properties").startObject()
.field("text").startObject()
.field("type", "string")
.endObject()
.endObject()
.endObject()
.endObject()
).execute().actionGet();

Indexing:

    IndexResponse indexResponse = client.prepareIndex(INDEX_

NAME, DOCUMENT_TYPE, "1")
.setSource(XContentFactory.jsonBuilder().startObject()
.field("text", ocrText)
.endObject()
).execute().actionGet();

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cb733511-1f65-420c-ae78-e75c9866f2fa%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

IronMike · February 27, 2014, 6:18pm

Binh, Thanks, With your help I think I am closer to the answer. Wih the
sample mapping you provided, I should be able to provide the base 64
contents of the image file as the "contents" field, and the ocrtext as
"text field. So, when the ocr text is searched, i can return the "content"
which is the image. With the above mapping I believe the image is saved in
the _source as well as the field for "highlighting " purposes, Can I
prevent it from being stored in _source by something like this?

startObject("_source").field("enabled","no").endObject()

On Thursday, February 27, 2014 8:29:25 AM UTC-5, Binh Ly wrote:

You certainly can add a new field, and then just put the OCR text into
that new field. So for example:

Mapping:
    PutMappingResponse putMappingResponse = new 
PutMappingRequestBuilder(

client.admin().indices()).setIndices(INDEX_NAME).setType(DOCUMENT_TYPE).setSource(
XContentFactory.jsonBuilder().startObject()
.field(DOCUMENT_TYPE).startObject()
.field("properties").startObject()
.field("text").startObject()
.field("type", "string")
.endObject()
.field("file").startObject()
.field("store", "yes")
.field("type", "attachment")
.field("fields").startObject()
.field("file").startObject()
.field("store", "yes")
.endObject()
.endObject()
.endObject()
.endObject()
.endObject()
.endObject()
).execute().actionGet();

Then put the OCR text into the "text" field:
    IndexResponse indexResponse = client.prepareIndex(INDEX_NAME, 
DOCUMENT_TYPE, "1")
.setSource(XContentFactory.jsonBuilder().startObject()
.field("text", ocrText)
.field("file").startObject()
.field("content", fileContents)
.field("_indexed_chars", -1)
.endObject()
.endObject()
).execute().actionGet();

You probably don't need to index the image binary information - not sure
what you would need it for.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a7db1379-5161-4f7d-ab78-a683c8beb07d%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

IronMike · February 27, 2014, 7:14pm

Sorry for the confusion - I do want PDFs, but I am concerned with the
retrieval of the image file when it ocr text is searched. I must be missing
something.
As showing below, I provide two fields "text" and the "content". In your
second post you say I don't need the "content' field for images? So, how
does the search return the image to the asking client "Web app" for
instance when a text match occurs with the image "ocr text"? If I only
include "text", then it will return the text part of the image only and not
the image, correct?

source(XContentFactory.jsonBuilder()

                             .startObject()

                              .field("text",ocrText)    //extracted ocr

text from image

                               .field( "file").startObject()

                                 .field("content", fileContents)

//content is the encoded base64string of the image file? is it needed?

                                 .field("_indexed_chars", -1)

                               .endObject()

                             .endObject()

On Thursday, February 27, 2014 1:16:36 PM UTC-5, Binh Ly wrote:

Oh, the attachment part is for your PDF. If you don't need to index PDFs
then just remove that part:
    PutMappingResponse putMappingResponse = new 
PutMappingRequestBuilder(
client.admin().indices()).
setIndices(INDEX_NAME).setType(DOCUMENT_TYPE).setSource(
XContentFactory.jsonBuilder().startObject()
.field(DOCUMENT_TYPE).startObject()
.field("properties").startObject()
.field("text").startObject()
.field("type", "string")
.endObject()
.endObject()
.endObject()
.endObject()
).execute().actionGet();

Indexing:
    IndexResponse indexResponse = client.prepareIndex(INDEX_
NAME, DOCUMENT_TYPE, "1")
.setSource(XContentFactory.jsonBuilder().startObject()
.field("text", ocrText)
.endObject()
).execute().actionGet();

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/35b9a36f-0a4e-4973-8c03-8d35f0af1a9f%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Binh_Ly_2 · February 27, 2014, 9:42pm

When you do a search, it will return your full _source document by default.
If you supplied a value for the text field at index time, then the text
field is included in the returned _source. If you supply some other field
at index time, then that field will also be returned from the _source. The
best way to try this is to actually work with the REST API. It allows you
to quickly and interactively test things before you write your Java code
and also it gives you a better understanding of how your Java code would
behave.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4f32b40f-5807-439c-b44c-c9090ebcd02e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.