Problems indexing attachments using attachment mapping

Massimiliano_Peranto · May 12, 2013, 7:59pm

Hi,
I installed elasticsearch flawlessly and started developing a mail indexing
solution.
Dealing with the main setup everything went flawlessly, I even installed
the plugin for tika document text extraction.
After that I wrote some simple beans to write in the system some emails
after parsing using java mail.
When it comes to index attachments (docs, pdfs, docx, open documents, etc
etc), several mails got indexed correctly, some others no.
I had some problems in putting direct base64 encoded documents from the
email, even because when it comes to encoding, I preferred to decode the
contents and reencode it, just to be sure I wrote everything correctly.
When I create the json file (attached to the email), I succeed even in
creating the decoded document whici is readable and the payload I pass to
elasticsearch is working.
Here are the versions:

elasticsearch versione 0.90.0
elasticsearch-mapper-attachments 1.7.0

See attached json as test document
Here's the mapping I used
curl -XGET 'http://localhost:9200/anagrafiche/email/_mapping?pretty=true'
{
"email" : {
"properties" : {
"addTimestamp" : {
"type" : "string"
},
"answered" : {
"type" : "boolean"
},
"attacheddocument" : {
"type" : "attachment",
"path" : "full",
"fields" : {
"attacheddocument" : {
"type" : "string"
},
"author" : {
"type" : "string"
},
"title" : {
"type" : "string"
},
"name" : {
"type" : "string"
},
"date" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"keywords" : {
"type" : "string"
},
"content_type" : {
"type" : "string"
}
}
},
"cgateId" : {
"type" : "string"
},
"contents" : {
"type" : "string"
},
"date" : {
"type" : "date",
"format" : "dateOptionalTime",
"include_in_all" : true
},
"filePath" : {
"type" : "string"
},
"from" : {
"properties" : {
"address" : {
"type" : "string"
},
"encodedPersonal" : {
"type" : "string",
"include_in_all" : true
}
}
},
"hasattachments" : {
"type" : "boolean"
},
"numlines" : {
"type" : "long"
},
"recipient" : {
"properties" : {
"address" : {
"type" : "string"
},
"encodedPersonal" : {
"type" : "string",
"include_in_all" : true
}
}
},
"seen" : {
"type" : "boolean"
},
"subject" : {
"type" : "string"
}
}
}
}

Here's the output of the indexing command attempt
[maxper@max ~]$ curl -XPOST 'http://localhost:9200/anagrafiche/email/' -d
@testindex.json
{"error":"MapperParsingException[failed to parse]; nested:
JsonParseException[Failed to decode VALUE_STRING as base64
(MIME-NO-LINEFEEDS): Unexpected padding character ('=') as character #3 of
4-char base64 unit: padding only legal as 3rd or 4th character\n at
[Source: [B@45387c9d; line: 1, column: 32804]]; ","status":400}
[maxper@max ~]$

Just to be clear, I really can index some documents, so the mapping should
be correct.

I hope someone may help me
Thanks, Massimiliano

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

raymond_giorgi · November 19, 2014, 7:22pm

Bump on this. I've tried using the built in Java Base64 encoding:
Base64.getMimeEncoder().encode(Files.readAllBytes(file.toPath()));

And using jackson's ObjectMapper function as follows
String base64 = mapper.writeValueAsString(Files.readAllBytes(file.toPath
()));
base64 = base64.substring(1, base64.length() - 1);

Can anyone help out out and point me where I'm going wrong?

On Sunday, May 12, 2013 3:59:54 PM UTC-4, Massimiliano Perantoni wrote:

Hi,
I installed elasticsearch flawlessly and started developing a mail
indexing solution.
Dealing with the main setup everything went flawlessly, I even installed
the plugin for tika document text extraction.
After that I wrote some simple beans to write in the system some emails
after parsing using java mail.
When it comes to index attachments (docs, pdfs, docx, open documents, etc
etc), several mails got indexed correctly, some others no.
I had some problems in putting direct base64 encoded documents from the
email, even because when it comes to encoding, I preferred to decode the
contents and reencode it, just to be sure I wrote everything correctly.
When I create the json file (attached to the email), I succeed even in
creating the decoded document whici is readable and the payload I pass to
elasticsearch is working.
Here are the versions:

elasticsearch versione 0.90.0
elasticsearch-mapper-attachments 1.7.0

See attached json as test document
Here's the mapping I used
curl -XGET 'http://localhost:9200/anagrafiche/email/_mapping?pretty=true'
{
"email" : {
"properties" : {
"addTimestamp" : {
"type" : "string"
},
"answered" : {
"type" : "boolean"
},
"attacheddocument" : {
"type" : "attachment",
"path" : "full",
"fields" : {
"attacheddocument" : {
"type" : "string"
},
"author" : {
"type" : "string"
},
"title" : {
"type" : "string"
},
"name" : {
"type" : "string"
},
"date" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"keywords" : {
"type" : "string"
},
"content_type" : {
"type" : "string"
}
}
},
"cgateId" : {
"type" : "string"
},
"contents" : {
"type" : "string"
},
"date" : {
"type" : "date",
"format" : "dateOptionalTime",
"include_in_all" : true
},
"filePath" : {
"type" : "string"
},
"from" : {
"properties" : {
"address" : {
"type" : "string"
},
"encodedPersonal" : {
"type" : "string",
"include_in_all" : true
}
}
},
"hasattachments" : {
"type" : "boolean"
},
"numlines" : {
"type" : "long"
},
"recipient" : {
"properties" : {
"address" : {
"type" : "string"
},
"encodedPersonal" : {
"type" : "string",
"include_in_all" : true
}
}
},
"seen" : {
"type" : "boolean"
},
"subject" : {
"type" : "string"
}
}
}
}

Here's the output of the indexing command attempt
[maxper@max ~]$ curl -XPOST 'http://localhost:9200/anagrafiche/email/'
-d @testindex.json
{"error":"MapperParsingException[failed to parse]; nested:
JsonParseException[Failed to decode VALUE_STRING as base64
(MIME-NO-LINEFEEDS): Unexpected padding character ('=') as character #3 of
4-char base64 unit: padding only legal as 3rd or 4th character\n at
[Source: [B@45387c9d; line: 1, column: 32804]]; ","status":400}
[maxper@max ~]$

Just to be clear, I really can index some documents, so the mapping should
be correct.

I hope someone may help me
Thanks, Massimiliano

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7dd13148-7aca-4930-90f9-00d2d747a3cf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

raymond_giorgi · November 19, 2014, 7:38pm

I should also mention that I'm trying to use the RabbitMQ river, which is
why I'm converting files into Base64 to begin with.

Thanks again!

On Wednesday, November 19, 2014 2:22:23 PM UTC-5,
raymond...@theresumator.com wrote:

Bump on this. I've tried using the built in Java Base64 encoding:
Base64.getMimeEncoder().encode(Files.readAllBytes(file.toPath()));

And using jackson's ObjectMapper function as follows
String base64 = mapper.writeValueAsString(Files.readAllBytes(file.toPath
()));
base64 = base64.substring(1, base64.length() - 1);

Can anyone help out out and point me where I'm going wrong?

On Sunday, May 12, 2013 3:59:54 PM UTC-4, Massimiliano Perantoni wrote:

Hi,
I installed elasticsearch flawlessly and started developing a mail
indexing solution.
Dealing with the main setup everything went flawlessly, I even installed
the plugin for tika document text extraction.
After that I wrote some simple beans to write in the system some emails
after parsing using java mail.
When it comes to index attachments (docs, pdfs, docx, open documents, etc
etc), several mails got indexed correctly, some others no.
I had some problems in putting direct base64 encoded documents from the
email, even because when it comes to encoding, I preferred to decode the
contents and reencode it, just to be sure I wrote everything correctly.
When I create the json file (attached to the email), I succeed even in
creating the decoded document whici is readable and the payload I pass to
elasticsearch is working.
Here are the versions:

elasticsearch versione 0.90.0
elasticsearch-mapper-attachments 1.7.0

See attached json as test document
Here's the mapping I used
curl -XGET 'http://localhost:9200/anagrafiche/email/_mapping?pretty=true'
{
"email" : {
"properties" : {
"addTimestamp" : {
"type" : "string"
},
"answered" : {
"type" : "boolean"
},
"attacheddocument" : {
"type" : "attachment",
"path" : "full",
"fields" : {
"attacheddocument" : {
"type" : "string"
},
"author" : {
"type" : "string"
},
"title" : {
"type" : "string"
},
"name" : {
"type" : "string"
},
"date" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"keywords" : {
"type" : "string"
},
"content_type" : {
"type" : "string"
}
}
},
"cgateId" : {
"type" : "string"
},
"contents" : {
"type" : "string"
},
"date" : {
"type" : "date",
"format" : "dateOptionalTime",
"include_in_all" : true
},
"filePath" : {
"type" : "string"
},
"from" : {
"properties" : {
"address" : {
"type" : "string"
},
"encodedPersonal" : {
"type" : "string",
"include_in_all" : true
}
}
},
"hasattachments" : {
"type" : "boolean"
},
"numlines" : {
"type" : "long"
},
"recipient" : {
"properties" : {
"address" : {
"type" : "string"
},
"encodedPersonal" : {
"type" : "string",
"include_in_all" : true
}
}
},
"seen" : {
"type" : "boolean"
},
"subject" : {
"type" : "string"
}
}
}
}

Here's the output of the indexing command attempt
[maxper@max ~]$ curl -XPOST 'http://localhost:9200/anagrafiche/email/'
-d @testindex.json
{"error":"MapperParsingException[failed to parse]; nested:
JsonParseException[Failed to decode VALUE_STRING as base64
(MIME-NO-LINEFEEDS): Unexpected padding character ('=') as character #3 of
4-char base64 unit: padding only legal as 3rd or 4th character\n at
[Source: [B@45387c9d; line: 1, column: 32804]]; ","status":400}
[maxper@max ~]$

Just to be clear, I really can index some documents, so the mapping
should be correct.

I hope someone may help me
Thanks, Massimiliano

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/428f1d4b-8a66-4a75-a120-b261894b7b89%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Attachment Mapper and Searching Elasticsearch	7	894	July 5, 2017
Attachments plugin - has anyone been using this successfully? Elasticsearch	1	279	July 6, 2017
ES + Attachment --> indexed documents incomplete Elasticsearch	11	605	July 6, 2017
Mapper attachment plugin fails to index document Elasticsearch	4	1547	July 5, 2017
Cannot use elasticsearch-mapper-attachments successfully Elasticsearch	1	476	July 5, 2017

Problems indexing attachments using attachment mapping

Related topics