Would it possible to dedicate certain ElasticSearch client nodes to do only
analyzing via mapper-attachment plugin?
Afterwards the indexing should be performed on the data nodes.
Goal would be offloading the nodes containing the indexes, as analyzing a
lot of large documents consumes a lot of resources.
Any thoughts or experiences will be very much appreciated.
I think you'd better do this in your own processus and outside an elasticsearch node.
You don't need to use mapper attachment and you can use directly Tika if you're a Java developer or any other library to extract content and metadata from it.
Actually, I did move the FSRiver from mapper attachment to Tika directly/ Now I have a fine control of my documents.
Better than that, I'm not forced anymore to send over the wire a full PDF document (10Mb) which contains mainly pictures and extract only a small amount of data (metadata for example).
Would it possible to dedicate certain ElasticSearch client nodes to do only analyzing via mapper-attachment plugin?
Afterwards the indexing should be performed on the data nodes.
Goal would be offloading the nodes containing the indexes, as analyzing a lot of large documents consumes a lot of resources.
Any thoughts or experiences will be very much appreciated.
Ah! Causing out of memory exception on node is not the best practice for sure!
That's one of the reason I would not put Tika in nodes directly.
One of my TODO item is to move FSRiver to logstash. So extracting content will be done by logstash (probably using Tika) but in a separate process than elasticsearch.
So once in, it will be supported in contracts.
--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
Thanks for your answer!
It does make sense.
The reason for my questions was: I wanted to take advantage of the mapper-attachment plugin,
and on the other hand the high availability and scalability features of ElasticSearch nodes to
perform the analyzing of the documents. E.g. we have experienced situations where Tika was
eating all the memory of a machine, and in the end died…
My thought was ElasticSearch could, in these situations, detect and remove the affected node
from the cluster.
I have no trouble with development but if we can use available software for which we have
a support contract then I prefer that way
On Wednesday, February 19, 2014 7:42:49 AM UTC+1, David Pilato wrote:
Ah! Causing out of memory exception on node is not the best practice for
sure!
That's one of the reason I would not put Tika in nodes directly.
One of my TODO item is to move FSRiver to logstash. So extracting content
will be done by logstash (probably using Tika) but in a separate process
than elasticsearch.
So once in, it will be supported in contracts.
That looks promising! I hope you have a short TODO list
Thanks again,
Jan
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.