Customized attachment mapper plugin for streaming, smile and extra metadata support

Not sure if I should post it here / in github

Thanks for the work of Mapper Attachment Plugin, and recently I have
created a customization for my work and I would like to share here.

The customization try to:

  • Support end to end streaming without loading whole JSON into memory
  • Support Smile
  • Support Accepting Images or other binary files for indexing
    Existing implementation accept only String and will throw Exception
    otherwise. Customized implementation now accept any binary format and will
    use Tika for detection
  • Support customized fields (e.g. EXIF for images)
  • Calculate the file checksum for my use case

Quite a lot of things and they may not applicable in general and
refactoring will need to be done,
I try to put things I would like to discuss one by one below and see if I
can help to submit any PR

####Streaming
This is quite complicated,
I used a dirty hack the get the actual Jackson JSONParser and then apply
the streaming method read JsonParser.readBinaryValue(OutputStream)


Then I pass it to apache commons io's TeeOutputStream, which will then use
java's Pipe to buffer the content to the actual inpustream consumed by Tika.
I am not a big fan of using multi Thread in these cases, but it seems
whenever the PipeInputStream is to be used the problem is inherently a
multi-threading one.
Here we are trading off duplicating of buffer data into threads to ensure
buffer is of limited size
depending on the process, in general the consumer stream is still much
faster than the input stream (likely network IO).
Are there any streaming libraries out there that could better handling this
problem?say Google's Guava ?

###Decouple index request with file content?
I also did some study comparing base64 / Smile indexing where later will be
more efficient
After all, will there be any benefit if we could separate the json index
request and the actual file content? i.e. with Multipart? That decouple the
encoding which make base64/Gzip easier.

####Smile
For Smile, in my scenario when the content object is being parsed will be
actually XContentParser.Token.VALUE_EMBEDDED_OBJECT, not
XContentParser.Token.VALUE_STRING

####Accept empty parsed content
For some files, after parsing the content is still empty. e.g. Images.
however metadata retrieved by Tika is more useful. Exception should not be
throw for such cases

####tika type
it seems currently field mapping cannot be added dynamically? i.e. when new
Tika metadata will needed to be stored, that will require customization in
the plugin. Tika is quite powerful and supports lots of metadata I wonder
is it possible to add a type "tika" like this, for example.

type:"tika",
metadatakey:"Image Height:"
store:true

where when such key exists in Tika's Metadata object, its value will be
populated and stored accordingly.

Also, I found in building with context.path().add(name), fields stored are
in path delimited by dot . but not in object format.

i.e. I will in the result document with key individually
"file_attachment.author"
"file_attachment.keywords"
"file_attachment.name"

can I group them into nested object?e.g. with image as an example

i.e.
file_attachment:{
image_exif:{
    height:{

####Testing

  • is ES0.90.3 moving to Junit or TestNG should stay?

Quite a lot of stuff, very happy if got feedback from you

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.