There are several different aspects when it comes to streaming files to
Elasticsearch.
-
The first is, input format is JSON, and the indexer should create
large JSON docs, but for tiny display later (highlighting). -
The second is, you want binary files store inside Lucene unindexed
(for whatever reason). The main challenge is how to handle this with
JSON (it requires something like base64 encoding in combination with
compression) -
And the third is, you want Elasticsearch do smart Lucene index codec
processing to create documents from streams "on the fly"
-> 1. I think there is no advantage in streaming for this case, since
Lucene needs somewhere the whole document in memory for the inverted
index statistics computation. If your input documents are large, just
add enough heap memory to get them processed. I'm not sure but most
search engines out there including Google have a hard limit how much of
a document is analyzed for highlighting or term indexing (the first 10k
characters maybe?). The reason is better performance. So maybe there is
little sense to enforce these features on Elasticsearch with large docs.
-> 2. you should think about it twice and maybe it is better to store
these files outside Elasticsearch. Retrieving large stored docs may hit
your performance, also relocating shards will be slow.
-> 3. you can implement a custom codec for Lucene 4 that may handle
your content streams gracefully. Such a codec is unfortunately
domain-specific, since it depends on the nature of the stream elements.
For example, such a codec could do complex event stream processing (CEP)
like in Esper http://esper.codehaus.org/
My 2c.
Jörg
Am 06.05.13 21:34, schrieb Yermakovich Siarhei:
Has anything changed since? Maybe some third-party plugins exist, that
allow to do large files streaming to Elasticsearch?
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.